Cross-Sectional Poisson Analysis of Effort Estimators and Structural Complexity File Metrics

Pre-processing and Exploratory Data Analysis

Load Data

setwd("~/Dropbox/Academia/Hawaii/Carlos_Thesis_Papers/Thesis/Chapters/scripts/data")
derby = read.csv("derby.csv", header = TRUE)
lucene = read.csv("lucene.csv", header = TRUE)
pdfbox = read.csv("pdfbox.csv", header = TRUE)
ivy = read.csv("ivy.csv", header = TRUE)

Derby, Lucene, Pdfbox, Ivy and Ftpserver are all projects of the Apache Software Foundation. Some data pertaining each of these projects is loaded on each variable.

The purpose of this analysis is to observe if we can establish any relationship between structural complexity of code files and the amount of effort that was taken to maintain them. Concretly, we operationalize structural complexity as a set of OO Metrics, each of which measure structural complexity. Effort is operationalized on the variables discussion, actions and churn.

Each observation is given in a row and can be considered as follows: For every change to a file to address an issue in a given release we calculate structural complexity file metrics and the associated effort. Concretely, each row is identified by the columns file, issue_code and release. A discussion in respect of how each effort estimator is mapped to a file metric is discussed on the paper.

Since this is a cross-sectional study, we must train and test our models in a given point in time. Since we only take measures per release, we further consider the set of datapoints that belong to each release as potential training or test sets. We must be careful however to analyze which releases can be used according to their size (some may lack enough data points to be used).

suppressMessages(library(plyr))
amountDataPointsPerRelease <- function(data) {
    ddply(data, .(release), summarise, n = length(release))
}

The amount of data points per release in Derby, Lucene, Pdfbox and Ivy respectively is as follows:

amountDataPointsPerRelease(derby)
##     release   n
## 1  10.1.1.0  97
## 2  10.1.2.1 162
## 3  10.1.3.1  57
## 4  10.2.1.6  14
## 5  10.2.2.0  10
## 6  10.3.1.4   4
## 7  10.3.2.1   1
## 8  10.3.3.0   4
## 9  10.4.2.0   7
## 10 10.5.1.1   4
## 11 10.5.3.0 106
## 12 10.6.1.0  84
## 13 10.6.2.1  30
## 14 10.7.1.1  83
amountDataPointsPerRelease(lucene)
##    release  n
## 1    1.9.1  1
## 2      2.2  1
## 3      2.3  3
## 4    2.3.1 20
## 5    2.3.2  7
## 6      2.4  3
## 7      2.9  5
## 8    2.9.1  2
## 9    2.9.2 56
## 10   2.9.3 35
## 11   2.9.4  2
## 12     3.0 17
amountDataPointsPerRelease(pdfbox)
##   release  n
## 1   1.1.0 43
## 2   1.2.1 48
## 3   1.3.1 23
## 4   1.4.0 56
## 5   1.5.0 53
## 6   1.6.0 10
amountDataPointsPerRelease(ivy)
##          release   n
## 1            2.0  14
## 2        2.0-RC1  29
## 3        2.0-RC2  10
## 4  2.0.0-alpha-2  18
## 5   2.0.0-beta-1  25
## 6   2.0.0-beta-2 190
## 7          2.1.0  48
## 8      2.1.0-RC1  22
## 9      2.1.0-RC2  16
## 10         2.2.0  25
## 11     2.2.0-RC1  11

We conclude that some releases can't be used for the analysis. We decided that a threshold of at least 40 data points is a reasonable amount of data for a release to be considered either as a training or as a test set.

filterReleases <- function(data, threshold) {
    # Obtain the release values that fall below the threshold for this dataset
    data.perRelease = amountDataPointsPerRelease(data)
    releases = data.perRelease[data.perRelease$n > threshold, 1]
    # Return datapoints that belongs only to the releases above the threshold
    data = data[data$release %in% releases, ]
    data
}

derby = filterReleases(derby, 40)
lucene = filterReleases(lucene, 40)
pdfbox = filterReleases(pdfbox, 40)
ivy = filterReleases(ivy, 40)

bla = ddply(derby, .(release, issue_code, discussion), summarise, churn_median = median(churn), 
    churn_mean = mean(churn), churn_sd = sd(churn), actions_max = max(actions), 
    mean_raw_loc = mean(raw_loc), mean_ckjm_dit = mean(ckjm_dit), mean_ckjm_ca = mean(ckjm_ca), 
    mean_ckjm_npm = mean(ckjm_npm), mean_ckjm_cbo = mean(ckjm_cbo), mean_ckjm_noc = mean(ckjm_noc), 
    mean_ckjm_rfc = mean(ckjm_rfc), mean_ckjm_lcom = mean(ckjm_lcom), mean_ckjm_wmc = mean(ckjm_wmc), 
    n = length(file))

derby.all = derby[, c(6, 7, 8, 9:17)]
lucene.all = lucene[, c(6, 7, 8, 9:17)]
pdfbox.all = pdfbox[, c(6, 7, 8, 9:17)]
ivy.all = ivy[, c(6, 7, 8, 9:17)]

# derby.means = ddply(derby, .(release,issue_code,discussion), summarise,
# churn_mean = mean(churn), actions_max = max(actions), mean_raw_loc =
# mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca =
# mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))

# Try to average the discussion instead of the files

derby.means = ddply(derby, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca, 
    ckjm_npm, ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise, 
    discussion_mean = mean(discussion), actions_mean = mean(actions))

cor(derby.means[, c(3:14)], method = "spearman")
##                   churn raw_loc ckjm_dit   ckjm_ca ckjm_npm ckjm_cbo
## churn           1.00000 0.20956  0.10334  0.023104  0.04809  0.01525
## raw_loc         0.20956 1.00000  0.02667  0.313768  0.55569  0.68729
## ckjm_dit        0.10334 0.02667  1.00000  0.161639 -0.01732 -0.22633
## ckjm_ca         0.02310 0.31377  0.16164  1.000000  0.59146  0.26484
## ckjm_npm        0.04809 0.55569 -0.01732  0.591456  1.00000  0.51375
## ckjm_cbo        0.01525 0.68729 -0.22633  0.264843  0.51375  1.00000
## ckjm_noc        0.02167 0.27814 -0.11865  0.304152  0.36782  0.25031
## ckjm_rfc        0.19838 0.94201  0.01222  0.363450  0.64584  0.76921
## ckjm_lcom       0.12127 0.69143  0.14654  0.476083  0.67603  0.50141
## ckjm_wmc        0.18375 0.85035  0.10970  0.539586  0.78530  0.61761
## discussion_mean 0.61275 0.10808  0.03009 -0.002724 -0.01154 -0.01904
## actions_mean    0.43649 0.02382  0.12171  0.019092 -0.10656 -0.04692
##                  ckjm_noc ckjm_rfc ckjm_lcom  ckjm_wmc discussion_mean
## churn            0.021671  0.19838   0.12127  0.183748        0.612755
## raw_loc          0.278143  0.94201   0.69143  0.850348        0.108077
## ckjm_dit        -0.118648  0.01222   0.14654  0.109699        0.030092
## ckjm_ca          0.304152  0.36345   0.47608  0.539586       -0.002724
## ckjm_npm         0.367824  0.64584   0.67603  0.785302       -0.011537
## ckjm_cbo         0.250315  0.76921   0.50141  0.617613       -0.019043
## ckjm_noc         1.000000  0.30474   0.32402  0.380496        0.008888
## ckjm_rfc         0.304740  1.00000   0.74000  0.904341        0.097768
## ckjm_lcom        0.324016  0.74000   1.00000  0.835038        0.074350
## ckjm_wmc         0.380496  0.90434   0.83504  1.000000        0.072536
## discussion_mean  0.008888  0.09777   0.07435  0.072536        1.000000
## actions_mean    -0.037214  0.01259  -0.04319 -0.005744        0.319389
##                 actions_mean
## churn               0.436488
## raw_loc             0.023816
## ckjm_dit            0.121707
## ckjm_ca             0.019092
## ckjm_npm           -0.106563
## ckjm_cbo           -0.046918
## ckjm_noc           -0.037214
## ckjm_rfc            0.012590
## ckjm_lcom          -0.043187
## ckjm_wmc           -0.005744
## discussion_mean     0.319389
## actions_mean        1.000000
cor(derby.all, method = "spearman")
##              churn   actions discussion raw_loc ckjm_dit   ckjm_ca
## churn      1.00000  0.436488   0.612755 0.20956  0.10334  0.023104
## actions    0.43649  1.000000   0.319389 0.02382  0.12171  0.019092
## discussion 0.61275  0.319389   1.000000 0.10808  0.03009 -0.002724
## raw_loc    0.20956  0.023816   0.108077 1.00000  0.02667  0.313768
## ckjm_dit   0.10334  0.121707   0.030092 0.02667  1.00000  0.161639
## ckjm_ca    0.02310  0.019092  -0.002724 0.31377  0.16164  1.000000
## ckjm_npm   0.04809 -0.106563  -0.011537 0.55569 -0.01732  0.591456
## ckjm_cbo   0.01525 -0.046918  -0.019043 0.68729 -0.22633  0.264843
## ckjm_noc   0.02167 -0.037214   0.008888 0.27814 -0.11865  0.304152
## ckjm_rfc   0.19838  0.012590   0.097768 0.94201  0.01222  0.363450
## ckjm_lcom  0.12127 -0.043187   0.074350 0.69143  0.14654  0.476083
## ckjm_wmc   0.18375 -0.005744   0.072536 0.85035  0.10970  0.539586
##            ckjm_npm ckjm_cbo  ckjm_noc ckjm_rfc ckjm_lcom  ckjm_wmc
## churn       0.04809  0.01525  0.021671  0.19838   0.12127  0.183748
## actions    -0.10656 -0.04692 -0.037214  0.01259  -0.04319 -0.005744
## discussion -0.01154 -0.01904  0.008888  0.09777   0.07435  0.072536
## raw_loc     0.55569  0.68729  0.278143  0.94201   0.69143  0.850348
## ckjm_dit   -0.01732 -0.22633 -0.118648  0.01222   0.14654  0.109699
## ckjm_ca     0.59146  0.26484  0.304152  0.36345   0.47608  0.539586
## ckjm_npm    1.00000  0.51375  0.367824  0.64584   0.67603  0.785302
## ckjm_cbo    0.51375  1.00000  0.250315  0.76921   0.50141  0.617613
## ckjm_noc    0.36782  0.25031  1.000000  0.30474   0.32402  0.380496
## ckjm_rfc    0.64584  0.76921  0.304740  1.00000   0.74000  0.904341
## ckjm_lcom   0.67603  0.50141  0.324016  0.74000   1.00000  0.835038
## ckjm_wmc    0.78530  0.61761  0.380496  0.90434   0.83504  1.000000

# lucene.means = ddply(lucene, .(release,issue_code,discussion),
# summarise, churn_mean = mean(churn), actions_max = max(actions),
# mean_raw_loc = mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca
# = mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))

lucene.means = ddply(lucene, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca, 
    ckjm_npm, ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise, 
    discussion_mean = mean(discussion), actions_mean = mean(actions))

cor(lucene.means[, c(3:14)], method = "spearman")
##                    churn  raw_loc ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000 -0.01779 -0.02746 -0.07796 -0.12384  0.01204
## raw_loc         -0.01779  1.00000  0.65722  0.63864  0.69413  0.85936
## ckjm_dit        -0.02746  0.65722  1.00000  0.79861  0.44885  0.61369
## ckjm_ca         -0.07796  0.63864  0.79861  1.00000  0.43759  0.57931
## ckjm_npm        -0.12384  0.69413  0.44885  0.43759  1.00000  0.55044
## ckjm_cbo         0.01204  0.85936  0.61369  0.57931  0.55044  1.00000
## ckjm_noc         0.02890  0.15666  0.47376  0.56650  0.06137  0.31684
## ckjm_rfc        -0.02734  0.96308  0.68829  0.62440  0.77516  0.89379
## ckjm_lcom       -0.15162  0.80937  0.67958  0.74415  0.62512  0.66545
## ckjm_wmc        -0.04474  0.94364  0.71369  0.70397  0.77584  0.81565
## discussion_mean  0.66407  0.02804 -0.04835 -0.09014 -0.01720  0.16206
## actions_mean     0.33773 -0.23778 -0.21412 -0.09185 -0.20546 -0.13125
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.02890 -0.02734   -0.1516 -0.04474         0.66407
## raw_loc          0.15666  0.96308    0.8094  0.94364         0.02804
## ckjm_dit         0.47376  0.68829    0.6796  0.71369        -0.04835
## ckjm_ca          0.56650  0.62440    0.7442  0.70397        -0.09014
## ckjm_npm         0.06137  0.77516    0.6251  0.77584        -0.01720
## ckjm_cbo         0.31684  0.89379    0.6655  0.81565         0.16206
## ckjm_noc         1.00000  0.16049    0.1983  0.22700         0.01291
## ckjm_rfc         0.16049  1.00000    0.7778  0.94026         0.06491
## ckjm_lcom        0.19827  0.77780    1.0000  0.84625        -0.12155
## ckjm_wmc         0.22700  0.94026    0.8462  1.00000         0.05218
## discussion_mean  0.01291  0.06491   -0.1216  0.05218         1.00000
## actions_mean    -0.04588 -0.17863   -0.2566 -0.23744         0.10253
##                 actions_mean
## churn                0.33773
## raw_loc             -0.23778
## ckjm_dit            -0.21412
## ckjm_ca             -0.09185
## ckjm_npm            -0.20546
## ckjm_cbo            -0.13125
## ckjm_noc            -0.04588
## ckjm_rfc            -0.17863
## ckjm_lcom           -0.25664
## ckjm_wmc            -0.23744
## discussion_mean      0.10253
## actions_mean         1.00000
cor(lucene.all, method = "spearman")
##               churn  actions discussion  raw_loc ckjm_dit  ckjm_ca
## churn       1.00000  0.33773    0.66407 -0.01779 -0.02746 -0.07796
## actions     0.33773  1.00000    0.10253 -0.23778 -0.21412 -0.09185
## discussion  0.66407  0.10253    1.00000  0.02804 -0.04835 -0.09014
## raw_loc    -0.01779 -0.23778    0.02804  1.00000  0.65722  0.63864
## ckjm_dit   -0.02746 -0.21412   -0.04835  0.65722  1.00000  0.79861
## ckjm_ca    -0.07796 -0.09185   -0.09014  0.63864  0.79861  1.00000
## ckjm_npm   -0.12384 -0.20546   -0.01720  0.69413  0.44885  0.43759
## ckjm_cbo    0.01204 -0.13125    0.16206  0.85936  0.61369  0.57931
## ckjm_noc    0.02890 -0.04588    0.01291  0.15666  0.47376  0.56650
## ckjm_rfc   -0.02734 -0.17863    0.06491  0.96308  0.68829  0.62440
## ckjm_lcom  -0.15162 -0.25664   -0.12155  0.80937  0.67958  0.74415
## ckjm_wmc   -0.04474 -0.23744    0.05218  0.94364  0.71369  0.70397
##            ckjm_npm ckjm_cbo ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## churn      -0.12384  0.01204  0.02890 -0.02734   -0.1516 -0.04474
## actions    -0.20546 -0.13125 -0.04588 -0.17863   -0.2566 -0.23744
## discussion -0.01720  0.16206  0.01291  0.06491   -0.1216  0.05218
## raw_loc     0.69413  0.85936  0.15666  0.96308    0.8094  0.94364
## ckjm_dit    0.44885  0.61369  0.47376  0.68829    0.6796  0.71369
## ckjm_ca     0.43759  0.57931  0.56650  0.62440    0.7442  0.70397
## ckjm_npm    1.00000  0.55044  0.06137  0.77516    0.6251  0.77584
## ckjm_cbo    0.55044  1.00000  0.31684  0.89379    0.6655  0.81565
## ckjm_noc    0.06137  0.31684  1.00000  0.16049    0.1983  0.22700
## ckjm_rfc    0.77516  0.89379  0.16049  1.00000    0.7778  0.94026
## ckjm_lcom   0.62512  0.66545  0.19827  0.77780    1.0000  0.84625
## ckjm_wmc    0.77584  0.81565  0.22700  0.94026    0.8462  1.00000

# ivy.means = ddply(ivy, .(release,issue_code,discussion), summarise,
# churn_mean = mean(churn), actions_max = max(actions), mean_raw_loc =
# mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca =
# mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))

ivy.means = ddply(ivy, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca, ckjm_npm, 
    ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise, discussion_mean = mean(discussion), 
    actions_mean = mean(actions))

cor(ivy.means[, c(3:14)], method = "spearman")
##                     churn  raw_loc  ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.000000 -0.04043  0.006849 -0.08043  0.01725 -0.01943
## raw_loc         -0.040433  1.00000  0.400096  0.31839  0.61566  0.78074
## ckjm_dit         0.006849  0.40010  1.000000  0.57474  0.30273  0.43413
## ckjm_ca         -0.080427  0.31839  0.574738  1.00000  0.47016  0.36316
## ckjm_npm         0.017248  0.61566  0.302729  0.47016  1.00000  0.47624
## ckjm_cbo        -0.019435  0.78074  0.434132  0.36316  0.47624  1.00000
## ckjm_noc         0.020330 -0.04828 -0.071066  0.03143  0.11215 -0.02191
## ckjm_rfc        -0.017455  0.95029  0.450035  0.34946  0.60861  0.88113
## ckjm_lcom        0.011535  0.63499  0.202051  0.46169  0.86990  0.52923
## ckjm_wmc        -0.002518  0.75340  0.382969  0.49444  0.94541  0.63855
## discussion_mean  0.593934 -0.14384 -0.107982 -0.11113 -0.06317 -0.13370
## actions_mean     0.226956 -0.13078 -0.010047  0.01550 -0.03301 -0.09987
##                  ckjm_noc ckjm_rfc ckjm_lcom  ckjm_wmc discussion_mean
## churn            0.020330 -0.01745   0.01153 -0.002518         0.59393
## raw_loc         -0.048281  0.95029   0.63499  0.753398        -0.14384
## ckjm_dit        -0.071066  0.45003   0.20205  0.382969        -0.10798
## ckjm_ca          0.031433  0.34946   0.46169  0.494441        -0.11113
## ckjm_npm         0.112151  0.60861   0.86990  0.945412        -0.06317
## ckjm_cbo        -0.021907  0.88113   0.52923  0.638547        -0.13370
## ckjm_noc         1.000000 -0.03717   0.13117  0.117528         0.08169
## ckjm_rfc        -0.037175  1.00000   0.61923  0.761252        -0.13262
## ckjm_lcom        0.131174  0.61923   1.00000  0.886539        -0.08457
## ckjm_wmc         0.117528  0.76125   0.88654  1.000000        -0.09615
## discussion_mean  0.081694 -0.13262  -0.08457 -0.096149         1.00000
## actions_mean    -0.009676 -0.09706  -0.08366 -0.062500         0.20078
##                 actions_mean
## churn               0.226956
## raw_loc            -0.130782
## ckjm_dit           -0.010047
## ckjm_ca             0.015500
## ckjm_npm           -0.033015
## ckjm_cbo           -0.099873
## ckjm_noc           -0.009676
## ckjm_rfc           -0.097063
## ckjm_lcom          -0.083659
## ckjm_wmc           -0.062500
## discussion_mean     0.200775
## actions_mean        1.000000
cor(ivy.all, method = "spearman")
##                 churn   actions discussion  raw_loc ckjm_dit  ckjm_ca
## churn       1.0000000  0.231230    0.60103 -0.04119  0.01088 -0.07864
## actions     0.2312298  1.000000    0.20458 -0.13002 -0.00880  0.01579
## discussion  0.6010334  0.204577    1.00000 -0.13951 -0.10279 -0.10908
## raw_loc    -0.0411928 -0.130020   -0.13951  1.00000  0.39710  0.31914
## ckjm_dit    0.0108848 -0.008800   -0.10279  0.39710  1.00000  0.57293
## ckjm_ca    -0.0786429  0.015791   -0.10908  0.31914  0.57293  1.00000
## ckjm_npm    0.0197779 -0.032689   -0.06123  0.61141  0.30135  0.47174
## ckjm_cbo   -0.0222131 -0.098882   -0.12914  0.78285  0.42915  0.36078
## ckjm_noc    0.0266050 -0.007428    0.08690 -0.04762 -0.06928  0.03172
## ckjm_rfc   -0.0197180 -0.096311   -0.12841  0.95058  0.44542  0.34798
## ckjm_lcom   0.0103519 -0.084503   -0.08560  0.63123  0.20079  0.46209
## ckjm_wmc    0.0004067 -0.061514   -0.09258  0.75081  0.38064  0.49576
##            ckjm_npm ckjm_cbo  ckjm_noc ckjm_rfc ckjm_lcom   ckjm_wmc
## churn       0.01978 -0.02221  0.026605 -0.01972   0.01035  0.0004067
## actions    -0.03269 -0.09888 -0.007428 -0.09631  -0.08450 -0.0615137
## discussion -0.06123 -0.12914  0.086902 -0.12841  -0.08560 -0.0925847
## raw_loc     0.61141  0.78285 -0.047620  0.95058   0.63123  0.7508100
## ckjm_dit    0.30135  0.42915 -0.069283  0.44542   0.20079  0.3806393
## ckjm_ca     0.47174  0.36078  0.031716  0.34798   0.46209  0.4957634
## ckjm_npm    1.00000  0.46786  0.112784  0.60029   0.87099  0.9454394
## ckjm_cbo    0.46786  1.00000 -0.021794  0.88246   0.52134  0.6326400
## ckjm_noc    0.11278 -0.02179  1.000000 -0.03722   0.12949  0.1187484
## ckjm_rfc    0.60029  0.88246 -0.037223  1.00000   0.61185  0.7550777
## ckjm_lcom   0.87099  0.52134  0.129492  0.61185   1.00000  0.8860091
## ckjm_wmc    0.94544  0.63264  0.118748  0.75508   0.88601  1.0000000


# pdfbox.means = ddply(pdfbox, .(release,issue_code,discussion),
# summarise, churn_mean = mean(churn), actions_max = max(actions),
# mean_raw_loc = mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca
# = mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))

pdfbox.means = ddply(pdfbox, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca, 
    ckjm_npm, ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise, 
    discussion_mean = mean(discussion), actions_mean = mean(actions))

cor(pdfbox.means[, c(3:14)], method = "spearman")
##                    churn raw_loc ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000 0.11951 -0.10151  0.11137   0.1699  0.12277
## raw_loc          0.11951 1.00000  0.35212  0.52607   0.5492  0.63545
## ckjm_dit        -0.10151 0.35212  1.00000  0.12527   0.2075  0.18796
## ckjm_ca          0.11137 0.52607  0.12527  1.00000   0.7093  0.26250
## ckjm_npm         0.16986 0.54918  0.20746  0.70931   1.0000  0.45676
## ckjm_cbo         0.12277 0.63545  0.18796  0.26250   0.4568  1.00000
## ckjm_noc         0.13196 0.17424  0.08788  0.35122   0.4152  0.18975
## ckjm_rfc         0.13774 0.81559  0.35444  0.40835   0.6080  0.85472
## ckjm_lcom        0.09268 0.68738  0.15630  0.53201   0.5550  0.50405
## ckjm_wmc         0.11786 0.74572  0.37223  0.68009   0.9217  0.57000
## discussion_mean  0.44049 0.16298 -0.02520  0.09515   0.1602  0.12136
## actions_mean    -0.08287 0.04366  0.15545 -0.07452  -0.1065 -0.04983
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.13196  0.13774   0.09268  0.11786         0.44049
## raw_loc          0.17424  0.81559   0.68738  0.74572         0.16298
## ckjm_dit         0.08788  0.35444   0.15630  0.37223        -0.02520
## ckjm_ca          0.35122  0.40835   0.53201  0.68009         0.09515
## ckjm_npm         0.41518  0.60803   0.55502  0.92173         0.16020
## ckjm_cbo         0.18975  0.85472   0.50405  0.57000         0.12136
## ckjm_noc         1.00000  0.27542   0.28975  0.36433        -0.04684
## ckjm_rfc         0.27542  1.00000   0.66605  0.76988         0.17440
## ckjm_lcom        0.28975  0.66605   1.00000  0.69034         0.09781
## ckjm_wmc         0.36433  0.76988   0.69034  1.00000         0.14754
## discussion_mean -0.04684  0.17440   0.09781  0.14754         1.00000
## actions_mean    -0.08394 -0.01554  -0.02925 -0.04775         0.21832
##                 actions_mean
## churn               -0.08287
## raw_loc              0.04366
## ckjm_dit             0.15545
## ckjm_ca             -0.07452
## ckjm_npm            -0.10649
## ckjm_cbo            -0.04983
## ckjm_noc            -0.08394
## ckjm_rfc            -0.01554
## ckjm_lcom           -0.02925
## ckjm_wmc            -0.04775
## discussion_mean      0.21832
## actions_mean         1.00000
cor(pdfbox.all, method = "spearman")
##               churn  actions discussion raw_loc ckjm_dit  ckjm_ca ckjm_npm
## churn       1.00000 -0.08287    0.44049 0.11951 -0.10151  0.11137   0.1699
## actions    -0.08287  1.00000    0.21832 0.04366  0.15545 -0.07452  -0.1065
## discussion  0.44049  0.21832    1.00000 0.16298 -0.02520  0.09515   0.1602
## raw_loc     0.11951  0.04366    0.16298 1.00000  0.35212  0.52607   0.5492
## ckjm_dit   -0.10151  0.15545   -0.02520 0.35212  1.00000  0.12527   0.2075
## ckjm_ca     0.11137 -0.07452    0.09515 0.52607  0.12527  1.00000   0.7093
## ckjm_npm    0.16986 -0.10649    0.16020 0.54918  0.20746  0.70931   1.0000
## ckjm_cbo    0.12277 -0.04983    0.12136 0.63545  0.18796  0.26250   0.4568
## ckjm_noc    0.13196 -0.08394   -0.04684 0.17424  0.08788  0.35122   0.4152
## ckjm_rfc    0.13774 -0.01554    0.17440 0.81559  0.35444  0.40835   0.6080
## ckjm_lcom   0.09268 -0.02925    0.09781 0.68738  0.15630  0.53201   0.5550
## ckjm_wmc    0.11786 -0.04775    0.14754 0.74572  0.37223  0.68009   0.9217
##            ckjm_cbo ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## churn       0.12277  0.13196  0.13774   0.09268  0.11786
## actions    -0.04983 -0.08394 -0.01554  -0.02925 -0.04775
## discussion  0.12136 -0.04684  0.17440   0.09781  0.14754
## raw_loc     0.63545  0.17424  0.81559   0.68738  0.74572
## ckjm_dit    0.18796  0.08788  0.35444   0.15630  0.37223
## ckjm_ca     0.26250  0.35122  0.40835   0.53201  0.68009
## ckjm_npm    0.45676  0.41518  0.60803   0.55502  0.92173
## ckjm_cbo    1.00000  0.18975  0.85472   0.50405  0.57000
## ckjm_noc    0.18975  1.00000  0.27542   0.28975  0.36433
## ckjm_rfc    0.85472  0.27542  1.00000   0.66605  0.76988
## ckjm_lcom   0.50405  0.28975  0.66605   1.00000  0.69034
## ckjm_wmc    0.57000  0.36433  0.76988   0.69034  1.00000

This leave us with the following ammount of releases and associated amount of data points for each project:

amountDataPointsPerRelease(derby)
##    release   n
## 1 10.1.1.0  97
## 2 10.1.2.1 162
## 3 10.1.3.1  57
## 4 10.5.3.0 106
## 5 10.6.1.0  84
## 6 10.7.1.1  83
amountDataPointsPerRelease(lucene)
##   release  n
## 1   2.9.2 56
amountDataPointsPerRelease(pdfbox)
##   release  n
## 1   1.1.0 43
## 2   1.2.1 48
## 3   1.4.0 56
## 4   1.5.0 53
amountDataPointsPerRelease(ivy)
##        release   n
## 1 2.0.0-beta-2 190
## 2        2.1.0  48

For this analysis, since we measured 3 different effort estimators, we are interested in creating 3 models, one for each effort estimator and our chosen structural complexity file metrics.

# Project data for the churn effort estimator models
derby.churn = derby[, c(6, 9:17)]
lucene.churn = lucene[, c(6, 9:17)]
pdfbox.churn = pdfbox[, c(6, 9:17)]
ivy.churn = ivy[, c(6, 9:17)]

# Project data for the actions effort estimator models
derby.actions = derby[, c(7, 9:17)]
lucene.actions = lucene[, c(7, 9:17)]
pdfbox.actions = pdfbox[, c(7, 9:17)]
ivy.actions = ivy[, c(7, 9:17)]

# Project data for the discussion effort estimator models
derby.discussion = derby[, c(8, 9:17)]
lucene.discussion = lucene[, c(8, 9:17)]
pdfbox.discussion = pdfbox[, c(8, 9:17)]
ivy.discussion = ivy[, c(8, 9:17)]

For the remaining three sub sections the analysis is similar given the nature of the variables. For each effort estimator, the following hypothesis will be tested:

  1. There is a statistical significant relationship between one or more structural complexity file metrics and effort estimation within a release.
  2. It is possible to make predictions of the effort estimator with the structural file complexity metrics of the same project upcoming releases.
  3. It is possible to make predictions of the effort estimator with structural file complexity metrics of different project releases.

All project Analysis

derby.all = derby[, c(6, 7, 8, 9:17)]
lucene.all = lucene[, c(6, 7, 8, 9:17)]
pdfbox.all = pdfbox[, c(6, 7, 8, 9:17)]
ivy.all = ivy[, c(6, 7, 8, 9:17)]

# If we use means instead of the real value which implies in the curse of
# glanularity
derby.all = derby.means[, c(3:14)]
lucene.all = lucene.means[, c(3:14)]
ivy.all = ivy.means[, c(3:14)]
pdfbox.all = pdfbox.means[, c(3:14)]

derby.all.list = split(derby.all, factor(derby$release))
lucene.all.list = split(lucene.all, factor(lucene$release))
pdfbox.all.list = split(pdfbox.all, factor(pdfbox$release))
ivy.all.list = split(ivy.all, factor(ivy$release))
## Warning: data length is not a multiple of split variable
cor(derby.all.list[[1]], method = "spearman")
##                      churn  raw_loc  ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.0000000 0.254256  0.282831 -0.11696  0.03694 -0.08520
## raw_loc          0.2542559 1.000000  0.008259  0.35417  0.57017  0.73710
## ckjm_dit         0.2828306 0.008259  1.000000  0.01973 -0.11481 -0.35746
## ckjm_ca         -0.1169627 0.354173  0.019730  1.00000  0.75065  0.44710
## ckjm_npm         0.0369445 0.570169 -0.114814  0.75065  1.00000  0.62068
## ckjm_cbo        -0.0852005 0.737098 -0.357457  0.44710  0.62068  1.00000
## ckjm_noc         0.0006234 0.163896 -0.177063  0.37648  0.37139  0.20376
## ckjm_rfc         0.2199467 0.962672 -0.020208  0.44928  0.66981  0.77360
## ckjm_lcom        0.1530250 0.738295  0.066292  0.54449  0.63755  0.56876
## ckjm_wmc         0.1939540 0.850710  0.022250  0.66296  0.82385  0.67275
## discussion_mean  0.6215001 0.251734  0.138554  0.01328  0.16360 -0.03983
## actions_mean     0.5891304 0.225210  0.304236  0.09660  0.16787 -0.06031
##                   ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.0006234  0.21995   0.15303  0.19395         0.62150
## raw_loc          0.1638955  0.96267   0.73829  0.85071         0.25173
## ckjm_dit        -0.1770632 -0.02021   0.06629  0.02225         0.13855
## ckjm_ca          0.3764767  0.44928   0.54449  0.66296         0.01328
## ckjm_npm         0.3713915  0.66981   0.63755  0.82385         0.16360
## ckjm_cbo         0.2037617  0.77360   0.56876  0.67275        -0.03983
## ckjm_noc         1.0000000  0.20271   0.20823  0.30841         0.01127
## ckjm_rfc         0.2027100  1.00000   0.73811  0.89357         0.22045
## ckjm_lcom        0.2082330  0.73811   1.00000  0.82654         0.16181
## ckjm_wmc         0.3084122  0.89357   0.82654  1.00000         0.22742
## discussion_mean  0.0112738  0.22045   0.16181  0.22742         1.00000
## actions_mean     0.1067537  0.23105   0.18701  0.28342         0.53506
##                 actions_mean
## churn                0.58913
## raw_loc              0.22521
## ckjm_dit             0.30424
## ckjm_ca              0.09660
## ckjm_npm             0.16787
## ckjm_cbo            -0.06031
## ckjm_noc             0.10675
## ckjm_rfc             0.23105
## ckjm_lcom            0.18701
## ckjm_wmc             0.28342
## discussion_mean      0.53506
## actions_mean         1.00000
cor(derby.all.list[[2]], method = "spearman")
##                     churn  raw_loc ckjm_dit ckjm_ca  ckjm_npm ckjm_cbo
## churn            1.000000  0.21475  0.14541 0.01016 -0.004322 -0.09635
## raw_loc          0.214746  1.00000  0.08504 0.18368  0.460489  0.56543
## ckjm_dit         0.145415  0.08504  1.00000 0.18146 -0.049855 -0.23951
## ckjm_ca          0.010163  0.18368  0.18146 1.00000  0.513358  0.09717
## ckjm_npm        -0.004322  0.46049 -0.04986 0.51336  1.000000  0.40405
## ckjm_cbo        -0.096355  0.56543 -0.23951 0.09717  0.404050  1.00000
## ckjm_noc        -0.074927  0.24025 -0.00888 0.31572  0.271653  0.08187
## ckjm_rfc         0.175661  0.94077  0.08577 0.24243  0.562374  0.65095
## ckjm_lcom        0.065729  0.66717  0.18730 0.44068  0.641307  0.41203
## ckjm_wmc         0.140878  0.82697  0.16696 0.46695  0.717843  0.46865
## discussion_mean  0.256012  0.01986 -0.10666 0.03830 -0.099919  0.01869
## actions_mean     0.390805 -0.12537  0.03322 0.05797 -0.205594 -0.22419
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn           -0.07493  0.17566   0.06573  0.14088         0.25601
## raw_loc          0.24025  0.94077   0.66717  0.82697         0.01986
## ckjm_dit        -0.00888  0.08577   0.18730  0.16696        -0.10666
## ckjm_ca          0.31572  0.24243   0.44068  0.46695         0.03830
## ckjm_npm         0.27165  0.56237   0.64131  0.71784        -0.09992
## ckjm_cbo         0.08187  0.65095   0.41203  0.46865         0.01869
## ckjm_noc         1.00000  0.20995   0.28632  0.32192        -0.11874
## ckjm_rfc         0.20995  1.00000   0.74068  0.87532         0.03043
## ckjm_lcom        0.28632  0.74068   1.00000  0.83400        -0.05556
## ckjm_wmc         0.32192  0.87532   0.83400  1.00000        -0.06952
## discussion_mean -0.11874  0.03043  -0.05556 -0.06952         1.00000
## actions_mean    -0.15830 -0.13092  -0.19137 -0.17575         0.51677
##                 actions_mean
## churn                0.39081
## raw_loc             -0.12537
## ckjm_dit             0.03322
## ckjm_ca              0.05797
## ckjm_npm            -0.20559
## ckjm_cbo            -0.22419
## ckjm_noc            -0.15830
## ckjm_rfc            -0.13092
## ckjm_lcom           -0.19137
## ckjm_wmc            -0.17575
## discussion_mean      0.51677
## actions_mean         1.00000
cor(derby.all.list[[3]], method = "spearman")
##                      churn  raw_loc  ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.0000000  0.15224  0.001261 -0.06511  -0.2403  0.04648
## raw_loc          0.1522361  1.00000 -0.019743  0.23238   0.5453  0.62586
## ckjm_dit         0.0012607 -0.01974  1.000000  0.15646  -0.0867 -0.26151
## ckjm_ca         -0.0651142  0.23238  0.156465  1.00000   0.6041  0.30042
## ckjm_npm        -0.2403042  0.54528 -0.086696  0.60414   1.0000  0.61431
## ckjm_cbo         0.0464833  0.62586 -0.261508  0.30042   0.6143  1.00000
## ckjm_noc        -0.1492371  0.39826 -0.141732  0.28338   0.5026  0.41552
## ckjm_rfc         0.0787770  0.95294 -0.100620  0.27286   0.6613  0.73537
## ckjm_lcom       -0.0001625  0.74187  0.095026  0.51974   0.7136  0.66605
## ckjm_wmc        -0.0338199  0.82289  0.022201  0.50218   0.8122  0.68525
## discussion_mean  0.7107962 -0.04523 -0.020782 -0.11096  -0.2441  0.02669
## actions_mean     0.5104830 -0.09393  0.159289 -0.26785  -0.2966 -0.01022
##                 ckjm_noc ckjm_rfc  ckjm_lcom ckjm_wmc discussion_mean
## churn           -0.14924  0.07878 -0.0001625 -0.03382         0.71080
## raw_loc          0.39826  0.95294  0.7418728  0.82289        -0.04523
## ckjm_dit        -0.14173 -0.10062  0.0950259  0.02220        -0.02078
## ckjm_ca          0.28338  0.27286  0.5197388  0.50218        -0.11096
## ckjm_npm         0.50258  0.66125  0.7135938  0.81221        -0.24413
## ckjm_cbo         0.41552  0.73537  0.6660491  0.68525         0.02669
## ckjm_noc         1.00000  0.45960  0.5689546  0.51993        -0.04646
## ckjm_rfc         0.45960  1.00000  0.7875099  0.88476        -0.07240
## ckjm_lcom        0.56895  0.78751  1.0000000  0.89737        -0.02056
## ckjm_wmc         0.51993  0.88476  0.8973724  1.00000        -0.15133
## discussion_mean -0.04646 -0.07240 -0.0205551 -0.15133         1.00000
## actions_mean    -0.27255 -0.12622 -0.1495810 -0.21070         0.54021
##                 actions_mean
## churn                0.51048
## raw_loc             -0.09393
## ckjm_dit             0.15929
## ckjm_ca             -0.26785
## ckjm_npm            -0.29663
## ckjm_cbo            -0.01022
## ckjm_noc            -0.27255
## ckjm_rfc            -0.12622
## ckjm_lcom           -0.14958
## ckjm_wmc            -0.21070
## discussion_mean      0.54021
## actions_mean         1.00000
cor(derby.all.list[[4]], method = "spearman")
##                    churn  raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000  0.37647  0.12690 0.27194  0.19763  0.08956
## raw_loc          0.37647  1.00000  0.01517 0.43832  0.57662  0.73423
## ckjm_dit         0.12690  0.01517  1.00000 0.04713 -0.07259 -0.30255
## ckjm_ca          0.27194  0.43832  0.04713 1.00000  0.60321  0.38881
## ckjm_npm         0.19763  0.57662 -0.07259 0.60321  1.00000  0.57356
## ckjm_cbo         0.08956  0.73423 -0.30255 0.38881  0.57356  1.00000
## ckjm_noc         0.18604  0.24846 -0.13798 0.28871  0.37368  0.27051
## ckjm_rfc         0.34912  0.91205 -0.11171 0.49527  0.68738  0.86335
## ckjm_lcom        0.25535  0.61233  0.10856 0.44214  0.59974  0.49265
## ckjm_wmc         0.41402  0.86444  0.07275 0.58261  0.76481  0.67273
## discussion_mean  0.70954  0.27146  0.16413 0.07099 -0.01229 -0.16431
## actions_mean    -0.08897 -0.15997  0.24027 0.02727 -0.25225 -0.18063
##                  ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.186038   0.3491    0.2554  0.41402         0.70954
## raw_loc          0.248463   0.9120    0.6123  0.86444         0.27146
## ckjm_dit        -0.137975  -0.1117    0.1086  0.07275         0.16413
## ckjm_ca          0.288705   0.4953    0.4421  0.58261         0.07099
## ckjm_npm         0.373676   0.6874    0.5997  0.76481        -0.01229
## ckjm_cbo         0.270509   0.8633    0.4926  0.67273        -0.16431
## ckjm_noc         1.000000   0.3194    0.2050  0.37597         0.14345
## ckjm_rfc         0.319355   1.0000    0.6457  0.91616         0.14180
## ckjm_lcom        0.205034   0.6457    1.0000  0.71794         0.15577
## ckjm_wmc         0.375969   0.9162    0.7179  1.00000         0.24105
## discussion_mean  0.143450   0.1418    0.1558  0.24105         1.00000
## actions_mean    -0.001394  -0.1593   -0.1156 -0.11502        -0.23824
##                 actions_mean
## churn              -0.088970
## raw_loc            -0.159970
## ckjm_dit            0.240272
## ckjm_ca             0.027265
## ckjm_npm           -0.252245
## ckjm_cbo           -0.180635
## ckjm_noc           -0.001394
## ckjm_rfc           -0.159282
## ckjm_lcom          -0.115608
## ckjm_wmc           -0.115018
## discussion_mean    -0.238242
## actions_mean        1.000000
cor(derby.all.list[[5]], method = "spearman")
##                    churn  raw_loc  ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000 0.212620  0.241218  0.2141  0.29689  0.05793
## raw_loc          0.21262 1.000000  0.004096  0.4044  0.71226  0.80900
## ckjm_dit         0.24122 0.004096  1.000000  0.2215  0.09085 -0.06357
## ckjm_ca          0.21405 0.404368  0.221522  1.0000  0.58666  0.18083
## ckjm_npm         0.29689 0.712263  0.090848  0.5867  1.00000  0.45730
## ckjm_cbo         0.05793 0.809001 -0.063570  0.1808  0.45730  1.00000
## ckjm_noc        -0.04234 0.313174 -0.156498  0.2034  0.38123  0.34970
## ckjm_rfc         0.22215 0.971008  0.064950  0.3868  0.73499  0.83297
## ckjm_lcom        0.31253 0.596361  0.201572  0.4600  0.79682  0.35982
## ckjm_wmc         0.31717 0.882342  0.151562  0.5348  0.90903  0.65303
## discussion_mean  0.81740 0.136541  0.056474  0.1440  0.26535 -0.02947
## actions_mean     0.43564 0.498693  0.225226  0.3107  0.37994  0.33309
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn           -0.04234  0.22215    0.3125   0.3172         0.81740
## raw_loc          0.31317  0.97101    0.5964   0.8823         0.13654
## ckjm_dit        -0.15650  0.06495    0.2016   0.1516         0.05647
## ckjm_ca          0.20342  0.38682    0.4600   0.5348         0.14400
## ckjm_npm         0.38123  0.73499    0.7968   0.9090         0.26535
## ckjm_cbo         0.34970  0.83297    0.3598   0.6530        -0.02947
## ckjm_noc         1.00000  0.33967    0.3737   0.3598        -0.03862
## ckjm_rfc         0.33967  1.00000    0.6355   0.9145         0.14717
## ckjm_lcom        0.37370  0.63550    1.0000   0.7943         0.26025
## ckjm_wmc         0.35977  0.91447    0.7943   1.0000         0.24043
## discussion_mean -0.03862  0.14717    0.2603   0.2404         1.00000
## actions_mean     0.02029  0.50569    0.3429   0.5081         0.30711
##                 actions_mean
## churn                0.43564
## raw_loc              0.49869
## ckjm_dit             0.22523
## ckjm_ca              0.31070
## ckjm_npm             0.37994
## ckjm_cbo             0.33309
## ckjm_noc             0.02029
## ckjm_rfc             0.50569
## ckjm_lcom            0.34295
## ckjm_wmc             0.50814
## discussion_mean      0.30711
## actions_mean         1.00000
cor(derby.all.list[[6]], method = "spearman")
##                    churn  raw_loc ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000  0.04068 -0.18123 -0.09533 -0.03863  0.13993
## raw_loc          0.04068  1.00000  0.02788  0.40114  0.55332  0.72140
## ckjm_dit        -0.18123  0.02788  1.00000  0.40002  0.13842 -0.03303
## ckjm_ca         -0.09533  0.40114  0.40002  1.00000  0.56138  0.25725
## ckjm_npm        -0.03863  0.55332  0.13842  0.56138  1.00000  0.52626
## ckjm_cbo         0.13993  0.72140 -0.03303  0.25725  0.52626  1.00000
## ckjm_noc         0.13628  0.37671 -0.09934  0.34339  0.44897  0.29756
## ckjm_rfc         0.10356  0.90899  0.03722  0.43370  0.63417  0.85060
## ckjm_lcom        0.01016  0.77201  0.14412  0.56493  0.75494  0.62148
## ckjm_wmc         0.04223  0.85202  0.11542  0.57488  0.76257  0.70321
## discussion_mean  0.67969 -0.16895  0.02266 -0.14024 -0.18610  0.05077
## actions_mean     0.65640 -0.13912 -0.19284 -0.15449 -0.18120  0.03868
##                  ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.136282  0.10356   0.01016  0.04223        0.679689
## raw_loc          0.376709  0.90899   0.77201  0.85202       -0.168949
## ckjm_dit        -0.099336  0.03722   0.14412  0.11542        0.022664
## ckjm_ca          0.343392  0.43370   0.56493  0.57488       -0.140239
## ckjm_npm         0.448975  0.63417   0.75494  0.76257       -0.186105
## ckjm_cbo         0.297563  0.85060   0.62148  0.70321        0.050771
## ckjm_noc         1.000000  0.40666   0.47768  0.48016       -0.000484
## ckjm_rfc         0.406661  1.00000   0.83373  0.92369       -0.101221
## ckjm_lcom        0.477679  0.83373   1.00000  0.94519       -0.182146
## ckjm_wmc         0.480157  0.92369   0.94519  1.00000       -0.188429
## discussion_mean -0.000484 -0.10122  -0.18215 -0.18843        1.000000
## actions_mean    -0.063289 -0.10899  -0.14501 -0.16042        0.485178
##                 actions_mean
## churn                0.65640
## raw_loc             -0.13912
## ckjm_dit            -0.19284
## ckjm_ca             -0.15449
## ckjm_npm            -0.18120
## ckjm_cbo             0.03868
## ckjm_noc            -0.06329
## ckjm_rfc            -0.10899
## ckjm_lcom           -0.14501
## ckjm_wmc            -0.16042
## discussion_mean      0.48518
## actions_mean         1.00000

cor(lucene.all.list[[1]], method = "spearman")
##                    churn  raw_loc ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000 -0.01779 -0.02746 -0.07796 -0.12384  0.01204
## raw_loc         -0.01779  1.00000  0.65722  0.63864  0.69413  0.85936
## ckjm_dit        -0.02746  0.65722  1.00000  0.79861  0.44885  0.61369
## ckjm_ca         -0.07796  0.63864  0.79861  1.00000  0.43759  0.57931
## ckjm_npm        -0.12384  0.69413  0.44885  0.43759  1.00000  0.55044
## ckjm_cbo         0.01204  0.85936  0.61369  0.57931  0.55044  1.00000
## ckjm_noc         0.02890  0.15666  0.47376  0.56650  0.06137  0.31684
## ckjm_rfc        -0.02734  0.96308  0.68829  0.62440  0.77516  0.89379
## ckjm_lcom       -0.15162  0.80937  0.67958  0.74415  0.62512  0.66545
## ckjm_wmc        -0.04474  0.94364  0.71369  0.70397  0.77584  0.81565
## discussion_mean  0.66407  0.02804 -0.04835 -0.09014 -0.01720  0.16206
## actions_mean     0.33773 -0.23778 -0.21412 -0.09185 -0.20546 -0.13125
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.02890 -0.02734   -0.1516 -0.04474         0.66407
## raw_loc          0.15666  0.96308    0.8094  0.94364         0.02804
## ckjm_dit         0.47376  0.68829    0.6796  0.71369        -0.04835
## ckjm_ca          0.56650  0.62440    0.7442  0.70397        -0.09014
## ckjm_npm         0.06137  0.77516    0.6251  0.77584        -0.01720
## ckjm_cbo         0.31684  0.89379    0.6655  0.81565         0.16206
## ckjm_noc         1.00000  0.16049    0.1983  0.22700         0.01291
## ckjm_rfc         0.16049  1.00000    0.7778  0.94026         0.06491
## ckjm_lcom        0.19827  0.77780    1.0000  0.84625        -0.12155
## ckjm_wmc         0.22700  0.94026    0.8462  1.00000         0.05218
## discussion_mean  0.01291  0.06491   -0.1216  0.05218         1.00000
## actions_mean    -0.04588 -0.17863   -0.2566 -0.23744         0.10253
##                 actions_mean
## churn                0.33773
## raw_loc             -0.23778
## ckjm_dit            -0.21412
## ckjm_ca             -0.09185
## ckjm_npm            -0.20546
## ckjm_cbo            -0.13125
## ckjm_noc            -0.04588
## ckjm_rfc            -0.17863
## ckjm_lcom           -0.25664
## ckjm_wmc            -0.23744
## discussion_mean      0.10253
## actions_mean         1.00000

cor(pdfbox.all.list[[1]], method = "spearman")
##                     churn raw_loc ckjm_dit   ckjm_ca ckjm_npm ckjm_cbo
## churn            1.000000 0.08266 -0.11297 -0.002506  0.03566  0.22045
## raw_loc          0.082659 1.00000  0.25233  0.503500  0.62121  0.65912
## ckjm_dit        -0.112967 0.25233  1.00000  0.208163  0.32667  0.07423
## ckjm_ca         -0.002506 0.50350  0.20816  1.000000  0.67078  0.18252
## ckjm_npm         0.035656 0.62121  0.32667  0.670776  1.00000  0.43445
## ckjm_cbo         0.220451 0.65912  0.07423  0.182521  0.43445  1.00000
## ckjm_noc         0.123168 0.29383  0.21313  0.412596  0.56535  0.12631
## ckjm_rfc         0.271648 0.82477  0.27079  0.349083  0.58778  0.83549
## ckjm_lcom        0.276944 0.63571 -0.13052  0.494029  0.47955  0.54711
## ckjm_wmc         0.078687 0.74983  0.35392  0.679891  0.96324  0.48678
## discussion_mean  0.523478 0.06732  0.04956 -0.095196  0.13650  0.34866
## actions_mean    -0.147172 0.23926  0.23382  0.292148  0.08876  0.02213
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.12317  0.27165    0.2769  0.07869         0.52348
## raw_loc          0.29383  0.82477    0.6357  0.74983         0.06732
## ckjm_dit         0.21313  0.27079   -0.1305  0.35392         0.04956
## ckjm_ca          0.41260  0.34908    0.4940  0.67989        -0.09520
## ckjm_npm         0.56535  0.58778    0.4795  0.96324         0.13650
## ckjm_cbo         0.12631  0.83549    0.5471  0.48678         0.34866
## ckjm_noc         1.00000  0.40773    0.3423  0.53112         0.09682
## ckjm_rfc         0.40773  1.00000    0.5884  0.67785         0.23822
## ckjm_lcom        0.34225  0.58839    1.0000  0.55711         0.11542
## ckjm_wmc         0.53112  0.67785    0.5571  1.00000         0.13507
## discussion_mean  0.09682  0.23822    0.1154  0.13507         1.00000
## actions_mean     0.02973  0.06261    0.1848  0.18422        -0.07067
##                 actions_mean
## churn               -0.14717
## raw_loc              0.23926
## ckjm_dit             0.23382
## ckjm_ca              0.29215
## ckjm_npm             0.08876
## ckjm_cbo             0.02213
## ckjm_noc             0.02973
## ckjm_rfc             0.06261
## ckjm_lcom            0.18484
## ckjm_wmc             0.18422
## discussion_mean     -0.07067
## actions_mean         1.00000
cor(pdfbox.all.list[[2]], method = "spearman")
##                    churn raw_loc  ckjm_dit   ckjm_ca ckjm_npm ckjm_cbo
## churn           1.000000 0.03451  0.011139 0.2088798   0.2469 0.005611
## raw_loc         0.034505 1.00000  0.460094 0.5333520   0.6101 0.689052
## ckjm_dit        0.011139 0.46009  1.000000 0.2077431   0.3912 0.423190
## ckjm_ca         0.208880 0.53335  0.207743 1.0000000   0.7324 0.373447
## ckjm_npm        0.246944 0.61009  0.391197 0.7324485   1.0000 0.670955
## ckjm_cbo        0.005611 0.68905  0.423190 0.3734470   0.6710 1.000000
## ckjm_noc        0.077062 0.12895  0.163043 0.2572310   0.2214 0.277218
## ckjm_rfc        0.096361 0.82956  0.581294 0.5186755   0.7731 0.879006
## ckjm_lcom       0.097579 0.72171  0.288556 0.5551578   0.6696 0.556320
## ckjm_wmc        0.199424 0.74139  0.576381 0.6879733   0.9551 0.764129
## discussion_mean 0.498902 0.17878 -0.007618 0.0002222   0.2147 0.156894
## actions_mean    0.204191 0.10636 -0.092386 0.0123882   0.1247 0.156123
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.07706  0.09636   0.09758   0.1994       0.4989020
## raw_loc          0.12895  0.82956   0.72171   0.7414       0.1787779
## ckjm_dit         0.16304  0.58129   0.28856   0.5764      -0.0076179
## ckjm_ca          0.25723  0.51868   0.55516   0.6880       0.0002222
## ckjm_npm         0.22141  0.77306   0.66956   0.9551       0.2147206
## ckjm_cbo         0.27722  0.87901   0.55632   0.7641       0.1568943
## ckjm_noc         1.00000  0.28467   0.23855   0.2459      -0.0683406
## ckjm_rfc         0.28467  1.00000   0.73548   0.8810       0.2291149
## ckjm_lcom        0.23855  0.73548   1.00000   0.7219       0.1956951
## ckjm_wmc         0.24593  0.88101   0.72194   1.0000       0.2235857
## discussion_mean -0.06834  0.22911   0.19570   0.2236       1.0000000
## actions_mean     0.02417  0.15784   0.07452   0.1246       0.4573630
##                 actions_mean
## churn                0.20419
## raw_loc              0.10636
## ckjm_dit            -0.09239
## ckjm_ca              0.01239
## ckjm_npm             0.12474
## ckjm_cbo             0.15612
## ckjm_noc             0.02417
## ckjm_rfc             0.15784
## ckjm_lcom            0.07452
## ckjm_wmc             0.12456
## discussion_mean      0.45736
## actions_mean         1.00000
cor(pdfbox.all.list[[3]], method = "spearman")
##                    churn raw_loc ckjm_dit   ckjm_ca ckjm_npm ckjm_cbo
## churn            1.00000  0.3075  0.13900  0.061350  0.11308  0.13370
## raw_loc          0.30748  1.0000  0.32322  0.365720  0.40616  0.63237
## ckjm_dit         0.13900  0.3232  1.00000  0.055127  0.10399  0.20465
## ckjm_ca          0.06135  0.3657  0.05513  1.000000  0.52272  0.09458
## ckjm_npm         0.11308  0.4062  0.10399  0.522722  1.00000  0.28181
## ckjm_cbo         0.13370  0.6324  0.20465  0.094576  0.28181  1.00000
## ckjm_noc         0.25668  0.1540  0.04316  0.433677  0.45411  0.15967
## ckjm_rfc         0.15534  0.7798  0.33889  0.216167  0.45783  0.88306
## ckjm_lcom        0.19973  0.5558  0.16441  0.217444  0.30211  0.42848
## ckjm_wmc         0.17238  0.6751  0.30955  0.465545  0.88590  0.48141
## discussion_mean -0.07297  0.1888  0.28107  0.043735 -0.05105 -0.01361
## actions_mean     0.01439  0.2057  0.21095 -0.009989 -0.08642  0.09200
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn            0.25668   0.1553   0.19973  0.17238        -0.07297
## raw_loc          0.15400   0.7798   0.55582  0.67511         0.18884
## ckjm_dit         0.04316   0.3389   0.16441  0.30955         0.28107
## ckjm_ca          0.43368   0.2162   0.21744  0.46554         0.04373
## ckjm_npm         0.45411   0.4578   0.30211  0.88590        -0.05105
## ckjm_cbo         0.15967   0.8831   0.42848  0.48141        -0.01361
## ckjm_noc         1.00000   0.2361   0.29716  0.36315        -0.19878
## ckjm_rfc         0.23609   1.0000   0.54911  0.70724         0.10807
## ckjm_lcom        0.29716   0.5491   1.00000  0.48857         0.08021
## ckjm_wmc         0.36315   0.7072   0.48857  1.00000         0.06814
## discussion_mean -0.19878   0.1081   0.08021  0.06814         1.00000
## actions_mean    -0.16068   0.1306   0.13876  0.01883         0.60793
##                 actions_mean
## churn               0.014387
## raw_loc             0.205691
## ckjm_dit            0.210948
## ckjm_ca            -0.009989
## ckjm_npm           -0.086417
## ckjm_cbo            0.092001
## ckjm_noc           -0.160680
## ckjm_rfc            0.130572
## ckjm_lcom           0.138761
## ckjm_wmc            0.018830
## discussion_mean     0.607926
## actions_mean        1.000000
cor(pdfbox.all.list[[4]], method = "spearman")
##                    churn raw_loc  ckjm_dit  ckjm_ca ckjm_npm  ckjm_cbo
## churn            1.00000  0.1215 -0.357074  0.15689   0.2996  0.213176
## raw_loc          0.12150  1.0000  0.314058  0.69346   0.5806  0.509925
## ckjm_dit        -0.35707  0.3141  1.000000  0.05843   0.0846  0.006989
## ckjm_ca          0.15689  0.6935  0.058430  1.00000   0.8085  0.330029
## ckjm_npm         0.29958  0.5806  0.084599  0.80854   1.0000  0.449054
## ckjm_cbo         0.21318  0.5099  0.006989  0.33003   0.4491  1.000000
## ckjm_noc        -0.04650  0.2515  0.080080  0.32735   0.3621  0.246030
## ckjm_rfc         0.15481  0.7931  0.201890  0.52023   0.5878  0.773068
## ckjm_lcom        0.03189  0.7808  0.166679  0.78342   0.7349  0.435924
## ckjm_wmc         0.11058  0.7596  0.291369  0.78276   0.8809  0.454229
## discussion_mean  0.77301  0.2366 -0.452609  0.31814   0.4328  0.131094
## actions_mean    -0.51955 -0.2441  0.349664 -0.32921  -0.4422 -0.379397
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn           -0.04650   0.1548   0.03189   0.1106          0.7730
## raw_loc          0.25152   0.7931   0.78077   0.7596          0.2366
## ckjm_dit         0.08008   0.2019   0.16668   0.2914         -0.4526
## ckjm_ca          0.32735   0.5202   0.78342   0.7828          0.3181
## ckjm_npm         0.36212   0.5878   0.73493   0.8809          0.4328
## ckjm_cbo         0.24603   0.7731   0.43592   0.4542          0.1311
## ckjm_noc         1.00000   0.2587   0.39938   0.3642         -0.1139
## ckjm_rfc         0.25870   1.0000   0.73494   0.7220          0.2110
## ckjm_lcom        0.39938   0.7349   1.00000   0.9122          0.1690
## ckjm_wmc         0.36415   0.7220   0.91225   1.0000          0.2361
## discussion_mean -0.11393   0.2110   0.16900   0.2361          1.0000
## actions_mean    -0.20291  -0.2926  -0.38453  -0.3656         -0.3560
##                 actions_mean
## churn                -0.5195
## raw_loc              -0.2441
## ckjm_dit              0.3497
## ckjm_ca              -0.3292
## ckjm_npm             -0.4422
## ckjm_cbo             -0.3794
## ckjm_noc             -0.2029
## ckjm_rfc             -0.2926
## ckjm_lcom            -0.3845
## ckjm_wmc             -0.3656
## discussion_mean      -0.3560
## actions_mean          1.0000

cor(ivy.all.list[[1]], method = "spearman")
##                     churn  raw_loc   ckjm_dit  ckjm_ca ckjm_npm ckjm_cbo
## churn            1.000000 -0.03073 -0.0504024 -0.10119  0.01100 -0.04262
## raw_loc         -0.030734  1.00000  0.4372613  0.36212  0.61813  0.74911
## ckjm_dit        -0.050402  0.43726  1.0000000  0.59081  0.28687  0.48172
## ckjm_ca         -0.101187  0.36212  0.5908138  1.00000  0.46860  0.43568
## ckjm_npm         0.010997  0.61813  0.2868664  0.46860  1.00000  0.48880
## ckjm_cbo        -0.042620  0.74911  0.4817189  0.43568  0.48880  1.00000
## ckjm_noc        -0.017535 -0.06029 -0.1151528  0.04982  0.13419  0.02978
## ckjm_rfc        -0.028410  0.95017  0.4892025  0.41187  0.63135  0.85834
## ckjm_lcom       -0.006379  0.61776  0.2187122  0.46319  0.87908  0.51944
## ckjm_wmc        -0.014363  0.74784  0.3734588  0.49677  0.95031  0.64037
## discussion_mean  0.582550 -0.12404 -0.1236095 -0.07788 -0.02660 -0.12553
## actions_mean     0.194687 -0.09717 -0.0005875  0.03507 -0.06197 -0.03752
##                 ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn           -0.01753 -0.02841 -0.006379 -0.01436         0.58255
## raw_loc         -0.06029  0.95017  0.617761  0.74784        -0.12404
## ckjm_dit        -0.11515  0.48920  0.218712  0.37346        -0.12361
## ckjm_ca          0.04982  0.41187  0.463193  0.49677        -0.07788
## ckjm_npm         0.13419  0.63135  0.879078  0.95031        -0.02660
## ckjm_cbo         0.02978  0.85834  0.519436  0.64037        -0.12553
## ckjm_noc         1.00000 -0.01115  0.170395  0.13733         0.06504
## ckjm_rfc        -0.01115  1.00000  0.620444  0.77330        -0.12276
## ckjm_lcom        0.17040  0.62044  1.000000  0.89317        -0.06813
## ckjm_wmc         0.13733  0.77330  0.893167  1.00000        -0.06515
## discussion_mean  0.06504 -0.12276 -0.068133 -0.06515         1.00000
## actions_mean    -0.06446 -0.05240 -0.106281 -0.07153         0.19486
##                 actions_mean
## churn              0.1946873
## raw_loc           -0.0971703
## ckjm_dit          -0.0005875
## ckjm_ca            0.0350730
## ckjm_npm          -0.0619714
## ckjm_cbo          -0.0375173
## ckjm_noc          -0.0644574
## ckjm_rfc          -0.0524048
## ckjm_lcom         -0.1062805
## ckjm_wmc          -0.0715348
## discussion_mean    0.1948607
## actions_mean       1.0000000
cor(ivy.all.list[[2]], method = "spearman")
##                    churn  raw_loc ckjm_dit  ckjm_ca  ckjm_npm ckjm_cbo
## churn            1.00000 -0.03704  0.24069 -0.03393  0.070096  0.07017
## raw_loc         -0.03704  1.00000  0.38094  0.25006  0.632097  0.83681
## ckjm_dit         0.24069  0.38094  1.00000  0.46648  0.425585  0.38095
## ckjm_ca         -0.03393  0.25006  0.46648  1.00000  0.501459  0.21847
## ckjm_npm         0.07010  0.63210  0.42558  0.50146  1.000000  0.48137
## ckjm_cbo         0.07017  0.83681  0.38095  0.21847  0.481374  1.00000
## ckjm_noc         0.18937 -0.05288  0.09644 -0.02595  0.005631 -0.21175
## ckjm_rfc         0.05595  0.93951  0.38279  0.16620  0.572420  0.91050
## ckjm_lcom        0.08840  0.66431  0.19367  0.50458  0.826130  0.55818
## ckjm_wmc         0.06277  0.77963  0.46949  0.49039  0.916560  0.65624
## discussion_mean  0.66106 -0.25453 -0.03955 -0.26531 -0.241694 -0.23307
## actions_mean     0.30466 -0.14218 -0.09492 -0.14315  0.084735 -0.21359
##                  ckjm_noc ckjm_rfc ckjm_lcom  ckjm_wmc discussion_mean
## churn            0.189370  0.05595   0.08840  0.062767         0.66106
## raw_loc         -0.052881  0.93951   0.66431  0.779626        -0.25453
## ckjm_dit         0.096442  0.38279   0.19367  0.469487        -0.03955
## ckjm_ca         -0.025952  0.16620   0.50458  0.490393        -0.26531
## ckjm_npm         0.005631  0.57242   0.82613  0.916560        -0.24169
## ckjm_cbo        -0.211746  0.91050   0.55818  0.656244        -0.23307
## ckjm_noc         1.000000 -0.13338  -0.07160  0.013338         0.14938
## ckjm_rfc        -0.133375  1.00000   0.62128  0.741185        -0.19643
## ckjm_lcom       -0.071597  0.62128   1.00000  0.858996        -0.18734
## ckjm_wmc         0.013338  0.74118   0.85900  1.000000        -0.23708
## discussion_mean  0.149377 -0.19643  -0.18734 -0.237078         1.00000
## actions_mean     0.168377 -0.14903   0.04066 -0.006774         0.25335
##                 actions_mean
## churn               0.304665
## raw_loc            -0.142185
## ckjm_dit           -0.094923
## ckjm_ca            -0.143152
## ckjm_npm            0.084735
## ckjm_cbo           -0.213590
## ckjm_noc            0.168377
## ckjm_rfc           -0.149028
## ckjm_lcom           0.040664
## ckjm_wmc           -0.006774
## discussion_mean     0.253353
## actions_mean        1.000000

Salva todos os releases de csv na pasta para ser usados no weka ou em outras ferramentas.

library(Hmisc)
## Loading required package: survival
## Loading required package: splines
## Hmisc library by Frank E Harrell Jr
## 
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall
## documentation.
## 
## NOTE:Hmisc no longer redefines [.factor to drop unused levels when
## subsetting.  To get the old behavior of Hmisc type dropUnusedLevels().
## Attaching package: 'Hmisc'
## The following object(s) are masked from 'package:survival':
## 
## untangle.specials
## The following object(s) are masked from 'package:plyr':
## 
## is.discrete, summarize
## The following object(s) are masked from 'package:base':
## 
## format.pval, round.POSIXt, trunc.POSIXt, units
#
# setwd('~/Dropbox/Academia/Hawaii/Carlos_Thesis_Papers/Thesis/Chapters/scripts/weka_data/discussion_variable_is_interval/pdfbox')
# Generate a csv whose name contain the amount of datapoint and the
# release so a cross sectional classification can be performed. for(i in
# 1:length(pdfbox.all.list)){ Create a list of dataframes where each
# dataframe contains datapoints of a given release.
# pdfbox.all.list[[i]]$discussion = cut2(pdfbox.all.list[[i]]$discussion)
# #Dichotomize discussion into intervals of same frequency
# write.csv(pdfbox.all.list[[i]],
# paste0('pdfbox_cross_sectional_n',nrow(pdfbox.all.list[[i]]),'_',pdfbox.all.list[[i]]$release[1],'.csv'))
# }
plot(derby.all.list[[1]])

plot of chunk unnamed-chunk-10

plot(derby.all.list[[2]])

plot of chunk unnamed-chunk-10

plot(derby.all.list[[3]])

plot of chunk unnamed-chunk-10

plot(derby.all.list[[4]])

plot of chunk unnamed-chunk-10

plot(derby.all.list[[5]])

plot of chunk unnamed-chunk-10

plot(derby.all.list[[6]])

plot of chunk unnamed-chunk-10


plot(lucene.all.list[[1]])

plot of chunk unnamed-chunk-10


plot(pdfbox.all.list[[1]])

plot of chunk unnamed-chunk-10

plot(pdfbox.all.list[[2]])

plot of chunk unnamed-chunk-10

plot(pdfbox.all.list[[3]])

plot of chunk unnamed-chunk-10

plot(pdfbox.all.list[[4]])

plot of chunk unnamed-chunk-10


plot(ivy.all.list[[1]])

plot of chunk unnamed-chunk-10

plot(ivy.all.list[[2]])

plot of chunk unnamed-chunk-10

### Discussion Effort Estimator Analysis

The first thing we must do is obtain the training and test sets fromt he filtered releases. Derby.discussion, lucene.discussion, pdfbox.discussion and ivy.discussion all contains data of all releases. Lets break them down per release:

```r
# Create a list of dataframes (tables) where each dataframe contains
# datapoints of a given release for each project.
derby.discussion.list = split(derby.discussion, factor(derby$release))
lucene.discussion.list = split(lucene.discussion, factor(lucene$release))
pdfbox.discussion.list = split(pdfbox.discussion, factor(pdfbox$release))
ivy.discussion.list = split(ivy.discussion, factor(ivy$release))

Since the order of the dataframes is the order in which the releases ocurred, the first position of each project dataframe contains the first release of each project, the second position of the second project and so on. This leave us with 6 releases of derby, 1 of lucene, 4 of pdfbox and 2 of ivy. Notice that this variation is influenced also by the mapping between file and issues, along with other several threats of validity reported in the paper.

Exploratory Data Analysis for Discussion Effort Estimator

Lets start by observing some characteristics of our datasets. One thing that we are interested in discussion is observing if there is any inflation of zeros (many issues with zero discussions) and how if it change over time.

suppressMessages(library(ggplot2))
qplot(discussion, data = derby, facets = release ~ ., geom = "histogram", binwidth = 1, 
    main = "Derby")

plot of chunk unnamed-chunk-12

We can observe that for the distribution of the 6 releases, only the forth (10.5.3.0.) and fifth (10.6.1.0) releases had an inflation of zero in discussions. This might affect how well a model trained in 10.1.1.0 will behave when trying to predict releases such as 4th. and 5th. We can also observe that overall most of the amount of discussion occur in a value range between 0 and 20.

The following plots display the distribution for the remaining projects (lucene, pdfbox, and ivy).

qplot(discussion, data = lucene, facets = release ~ ., geom = "histogram", binwidth = 1, 
    main = "Lucene")

plot of chunk unnamed-chunk-13

qplot(discussion, data = pdfbox, facets = release ~ ., geom = "histogram", binwidth = 1, 
    main = "Pdfbox")

plot of chunk unnamed-chunk-13

qplot(discussion, data = ivy, facets = release ~ ., geom = "histogram", binwidth = 1, 
    main = "Ivy")

plot of chunk unnamed-chunk-13

We can see that an inflation of zeros is not common over the discussion distribution (again, beware that the missing zero discussion may be related to the issue mapping threat of validity).

For the sake of this analysis, I will consider derby as the training set of the model. The remaining releases will be considered test datasets for testing the hypothesis.

derby.discussion.train = derby.discussion.list[[1]]
head(derby.discussion.train)
##    discussion raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo ckjm_noc ckjm_rfc
## 7          15      23        2       1        3        0        0        6
## 41         15    1373        0      11       71       19        1      267
## 52         15    2998        1      13      178       20        1      429
## 53         15    2998        1      13      178       20        1      429
## 54          7    2998        1      13      178       20        1      429
## 70         12    1563        1      16       76       20        1      251
##    ckjm_lcom ckjm_wmc
## 7          0        3
## 41       450      109
## 52      3560      268
## 53      3560      268
## 54      3560      268
## 70      2946      129

Lastly we can also observe how the distribution of each of the structural complexity file metrics vary over time. Since I am only concerned with the first hypothesis test at this point, that is, intra project analysis, I defer further analysis comparing among projects until the appropriate hypothesis test.

a <- qplot(release, discussion, data = derby, geom = "boxplot")
b <- qplot(release, raw_loc, data = derby, geom = "boxplot")
c <- qplot(release, ckjm_dit, data = derby, geom = "boxplot")
d <- qplot(release, ckjm_ca, data = derby, geom = "boxplot")
e <- qplot(release, ckjm_npm, data = derby, geom = "boxplot")
f <- qplot(release, ckjm_cbo, data = derby, geom = "boxplot")
g <- qplot(release, ckjm_noc, data = derby, geom = "boxplot")
h <- qplot(release, ckjm_rfc, data = derby, geom = "boxplot")
i <- qplot(release, ckjm_lcom, data = derby, geom = "boxplot")
j <- qplot(release, ckjm_wmc, data = derby, geom = "boxplot")

library(grid)
grid.newpage()
pushViewport(viewport(layout = grid.layout(10, 1)))
print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1), main = "Derby")
print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(c, vp = viewport(layout.pos.row = 3, layout.pos.col = 1))
print(d, vp = viewport(layout.pos.row = 4, layout.pos.col = 1))
print(e, vp = viewport(layout.pos.row = 5, layout.pos.col = 1))
print(f, vp = viewport(layout.pos.row = 6, layout.pos.col = 1))
print(g, vp = viewport(layout.pos.row = 7, layout.pos.col = 1))
print(h, vp = viewport(layout.pos.row = 8, layout.pos.col = 1))
print(i, vp = viewport(layout.pos.row = 9, layout.pos.col = 1))
print(j, vp = viewport(layout.pos.row = 10, layout.pos.col = 1))

plot of chunk unnamed-chunk-15

We can see that there is a great dispersion of discussion across the releases, while the same does not occur with the structural complexity file metrics.

Statistical Model

This section script performs an analysis of the data based almost entirely on this journal.

We need to observe which among our predictors are correlated. Correlated predictors should not be used.

cor(derby.discussion.train, method = "spearman")
##            discussion  raw_loc  ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## discussion    1.00000 0.251734  0.138554 0.01328   0.1636 -0.03983
## raw_loc       0.25173 1.000000  0.008259 0.35417   0.5702  0.73710
## ckjm_dit      0.13855 0.008259  1.000000 0.01973  -0.1148 -0.35746
## ckjm_ca       0.01328 0.354173  0.019730 1.00000   0.7506  0.44710
## ckjm_npm      0.16360 0.570169 -0.114814 0.75065   1.0000  0.62068
## ckjm_cbo     -0.03983 0.737098 -0.357457 0.44710   0.6207  1.00000
## ckjm_noc      0.01127 0.163896 -0.177063 0.37648   0.3714  0.20376
## ckjm_rfc      0.22045 0.962672 -0.020208 0.44928   0.6698  0.77360
## ckjm_lcom     0.16181 0.738295  0.066292 0.54449   0.6375  0.56876
## ckjm_wmc      0.22742 0.850710  0.022250 0.66296   0.8238  0.67275
##            ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## discussion  0.01127  0.22045   0.16181  0.22742
## raw_loc     0.16390  0.96267   0.73829  0.85071
## ckjm_dit   -0.17706 -0.02021   0.06629  0.02225
## ckjm_ca     0.37648  0.44928   0.54449  0.66296
## ckjm_npm    0.37139  0.66981   0.63755  0.82385
## ckjm_cbo    0.20376  0.77360   0.56876  0.67275
## ckjm_noc    1.00000  0.20271   0.20823  0.30841
## ckjm_rfc    0.20271  1.00000   0.73811  0.89357
## ckjm_lcom   0.20823  0.73811   1.00000  0.82654
## ckjm_wmc    0.30841  0.89357   0.82654  1.00000

From this list, by filtering among all possible combinations of predictors, aside those who are correlated we are left which the following possible predictor combination:

Predictors

Model 1: ['dit', 'ca', 'rawloc', 'noc'] Model 2: ['dit', 'npm', 'rawloc', 'noc'] Model 3: ['dit', 'rfc', 'ca', 'noc'] Model 4: ['dit', 'npm', 'rfc', 'noc'] Model 5: ['dit', 'ca', 'lcom', 'cbo', 'noc'] Model 6: ['dit', 'wmc', 'ca', 'cbo', 'noc'] Model 7: ['dit', 'npm', 'lcom', 'cbo', 'noc']

We now fit the poisson models.

model1 <- glm(discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, data = derby.discussion.train, 
    family = poisson)
model2 <- glm(discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, data = derby.discussion.train, 
    family = poisson)
model3 <- glm(discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, data = derby.discussion.train, 
    family = poisson)
model4 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, data = derby.discussion.train, 
    family = poisson)
model5 <- glm(discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + ckjm_noc, 
    data = derby.discussion.train, family = poisson)
model6 <- glm(discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + ckjm_noc, 
    data = derby.discussion.train, family = poisson)
model7 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + ckjm_cbo + ckjm_noc, 
    data = derby.discussion.train, family = poisson)

# model1 model2 model3 model4 model5 model6

Zero inflated models

library(pscl)
## Loading required package: MASS
## Loading required package: mvtnorm
## Loading required package: coda
## Loading required package: lattice
## Loading required package: gam
## Loaded gam 1.06.2
## Loading required package: vcd
## Loading required package: colorspace
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
znmodel1 <- glm(discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, data = derby.discussion.list[[4]], 
    family = poisson)
znmodel2 <- glm(discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, data = derby.discussion.list[[4]], 
    family = poisson)
znmodel3 <- glm(discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, data = derby.discussion.list[[4]], 
    family = poisson)
znmodel4 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, data = derby.discussion.list[[4]], 
    family = poisson)
znmodel5 <- glm(discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + ckjm_noc, 
    data = derby.discussion.list[[4]], family = poisson)
znmodel6 <- glm(discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + ckjm_noc, 
    data = derby.discussion.list[[4]], family = poisson)
znmodel7 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + ckjm_cbo + ckjm_noc, 
    data = derby.discussion.list[[4]], family = poisson)
summary(znmodel1)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, 
##     family = poisson, data = derby.discussion.list[[4]])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.231  -4.420   0.577   1.645   5.548  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.34e+00   4.22e-02   55.34  < 2e-16 ***
## ckjm_dit     4.47e-02   3.23e-02    1.38    0.166    
## ckjm_ca     -4.43e-03   2.49e-03   -1.78    0.075 .  
## raw_loc      1.59e-04   2.18e-05    7.30  2.9e-13 ***
## ckjm_noc    -9.57e-03   1.49e-02   -0.64    0.520    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1098.1  on 105  degrees of freedom
## Residual deviance: 1017.7  on 101  degrees of freedom
## AIC: 1372
## 
## Number of Fisher Scoring iterations: 6
summary(znmodel2)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, 
##     family = poisson, data = derby.discussion.list[[4]])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.204  -4.478   0.499   1.636   5.610  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.30e+00   4.21e-02   54.55  < 2e-16 ***
## ckjm_dit     6.42e-02   3.91e-02    1.64     0.10    
## ckjm_npm     4.12e-04   1.07e-03    0.39     0.70    
## raw_loc      1.38e-04   3.26e-05    4.24  2.2e-05 ***
## ckjm_noc    -1.37e-02   1.50e-02   -0.91     0.36    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1098.1  on 105  degrees of freedom
## Residual deviance: 1020.8  on 101  degrees of freedom
## AIC: 1376
## 
## Number of Fisher Scoring iterations: 5
summary(znmodel3)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, 
##     family = poisson, data = derby.discussion.list[[4]])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.552  -4.243   0.533   1.640   5.662  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.305351   0.042940   53.69  < 2e-16 ***
## ckjm_dit     0.046529   0.032220    1.44    0.149    
## ckjm_rfc     0.001467   0.000212    6.93  4.3e-12 ***
## ckjm_ca     -0.006462   0.002612   -2.47    0.013 *  
## ckjm_noc    -0.014122   0.015265   -0.93    0.355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1098.1  on 105  degrees of freedom
## Residual deviance: 1017.7  on 101  degrees of freedom
## AIC: 1372
## 
## Number of Fisher Scoring iterations: 6
summary(znmodel4)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, 
##     family = poisson, data = derby.discussion.list[[4]])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.575  -4.411   0.471   1.699   5.686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.259906   0.041988   53.82  < 2e-16 ***
## ckjm_dit     0.082820   0.037511    2.21  0.02725 *  
## ckjm_npm     0.000736   0.001054    0.70  0.48527    
## ckjm_rfc     0.001102   0.000287    3.84  0.00012 ***
## ckjm_noc    -0.018940   0.015360   -1.23  0.21754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1098.1  on 105  degrees of freedom
## Residual deviance: 1023.7  on 101  degrees of freedom
## AIC: 1378
## 
## Number of Fisher Scoring iterations: 5
summary(znmodel5)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + 
##     ckjm_noc, family = poisson, data = derby.discussion.list[[4]])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.807  -4.202   0.681   1.400   5.854  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.49e+00   5.04e-02   49.33  < 2e-16 ***
## ckjm_dit     1.38e-01   3.00e-02    4.61  4.0e-06 ***
## ckjm_ca      2.23e-04   2.53e-03    0.09     0.93    
## ckjm_lcom    4.39e-05   9.76e-06    4.50  6.7e-06 ***
## ckjm_cbo    -7.78e-03   1.79e-03   -4.34  1.4e-05 ***
## ckjm_noc    -6.69e-03   1.46e-02   -0.46     0.65    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1098.1  on 105  degrees of freedom
## Residual deviance: 1038.8  on 100  degrees of freedom
## AIC: 1396
## 
## Number of Fisher Scoring iterations: 6
summary(znmodel6)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + 
##     ckjm_noc, family = poisson, data = derby.discussion.list[[4]])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -6.28   -3.89    0.61    1.60    5.23  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.465674   0.048245   51.11  < 2e-16 ***
## ckjm_dit     0.017124   0.031349    0.55    0.585    
## ckjm_wmc     0.006643   0.000462   14.38  < 2e-16 ***
## ckjm_ca     -0.006287   0.002793   -2.25    0.024 *  
## ckjm_cbo    -0.012407   0.001608   -7.72  1.2e-14 ***
## ckjm_noc    -0.018571   0.016443   -1.13    0.259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1098.13  on 105  degrees of freedom
## Residual deviance:  901.86  on 100  degrees of freedom
## AIC: 1259
## 
## Number of Fisher Scoring iterations: 6

zmodel1 <- zeroinfl(discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, data = derby.discussion.list[[4]])
zmodel2 <- zeroinfl(discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, data = derby.discussion.list[[4]])
zmodel3 <- zeroinfl(discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, data = derby.discussion.list[[4]])
zmodel4 <- zeroinfl(discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, 
    data = derby.discussion.list[[4]])
zmodel5 <- zeroinfl(discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + 
    ckjm_noc, data = derby.discussion.list[[4]])
## Error: system is computationally singular: reciprocal condition number =
## 5.91079e-20
zmodel6 <- zeroinfl(discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + 
    ckjm_noc, data = derby.discussion.list[[4]])
zmodel7 <- zeroinfl(discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + ckjm_cbo + 
    ckjm_noc, data = derby.discussion.list[[4]])
## Error: system is computationally singular: reciprocal condition number =
## 6.44213e-20
summary(zmodel1)
## Warning: NaNs produced
## 
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, 
##     data = derby.discussion.list[[4]])
## 
## Pearson residuals:
##    Min     1Q Median     3Q    Max 
## -2.912 -1.040  0.221  0.641  3.507 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.68e+00   4.19e-02   63.93   <2e-16 ***
## ckjm_dit     1.25e-01   5.11e-03   24.44   <2e-16 ***
## ckjm_ca     -3.48e-03   2.30e-03   -1.51    0.130    
## raw_loc      3.67e-05         NA      NA       NA    
## ckjm_noc     3.78e-02   1.87e-02    2.02    0.044 *  
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.646447   0.383328   -1.69  0.09172 .  
## ckjm_dit     0.450426   0.420617    1.07  0.28423    
## ckjm_ca      0.014622   0.021734    0.67  0.50109    
## raw_loc     -0.001349   0.000389   -3.47  0.00052 ***
## ckjm_noc     0.086919   0.098703    0.88  0.37853    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -346 on 10 Df
summary(zmodel2)
## Warning: NaNs produced
## 
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_npm + raw_loc + 
##     ckjm_noc, data = derby.discussion.list[[4]])
## 
## Pearson residuals:
##    Min     1Q Median     3Q    Max 
## -2.680 -1.029  0.202  0.712  4.073 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.63e+00   4.10e-02   64.11  < 2e-16 ***
## ckjm_dit     1.83e-01   2.30e-02    7.96  1.7e-15 ***
## ckjm_npm     2.11e-03   3.19e-04    6.62  3.7e-11 ***
## raw_loc     -2.59e-05         NA      NA       NA    
## ckjm_noc     2.69e-02   1.93e-02    1.40     0.16    
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.621536   0.388106   -1.60   0.1093   
## ckjm_dit     0.451425   0.421888    1.07   0.2846   
## ckjm_npm     0.011387   0.012837    0.89   0.3751   
## raw_loc     -0.001577   0.000502   -3.14   0.0017 **
## ckjm_noc     0.084227   0.096575    0.87   0.3831   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 17 
## Log-likelihood: -345 on 10 Df
summary(zmodel3)
## 
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + 
##     ckjm_noc, data = derby.discussion.list[[4]])
## 
## Pearson residuals:
##    Min     1Q Median     3Q    Max 
## -2.751 -1.127  0.239  0.791  3.314 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.67720    0.04319   61.99  < 2e-16 ***
## ckjm_dit     0.12855    0.03388    3.79  0.00015 ***
## ckjm_rfc     0.00030    0.00023    1.30  0.19201    
## ckjm_ca     -0.00382    0.00267   -1.43  0.15306    
## ckjm_noc     0.03578    0.01942    1.84  0.06540 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.67123    0.38221   -1.76    0.079 .
## ckjm_dit     0.29959    0.37253    0.80    0.421  
## ckjm_rfc    -0.00643    0.00269   -2.39    0.017 *
## ckjm_ca      0.01267    0.02133    0.59    0.553  
## ckjm_noc     0.10202    0.10017    1.02    0.308  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 20 
## Log-likelihood: -348 on 10 Df
summary(zmodel4)
## 
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + 
##     ckjm_noc, data = derby.discussion.list[[4]])
## 
## Pearson residuals:
##    Min     1Q Median     3Q    Max 
## -2.291 -1.140  0.162  0.780  3.690 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.636200   0.042979   61.34  < 2e-16 ***
## ckjm_dit     0.188912   0.037783    5.00  5.7e-07 ***
## ckjm_npm     0.002331   0.001051    2.22    0.027 *  
## ckjm_rfc    -0.000309   0.000281   -1.10    0.272    
## ckjm_noc     0.028987   0.019706    1.47    0.141    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.64110    0.38410   -1.67    0.095 .
## ckjm_dit     0.30869    0.38738    0.80    0.426  
## ckjm_npm     0.00722    0.01262    0.57    0.567  
## ckjm_rfc    -0.00718    0.00353   -2.03    0.042 *
## ckjm_noc     0.10482    0.09859    1.06    0.288  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -346 on 10 Df
summary(zmodel5)
## Error: object 'zmodel5' not found
summary(zmodel6)
## 
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + 
##     ckjm_cbo + ckjm_noc, data = derby.discussion.list[[4]])
## 
## Pearson residuals:
##    Min     1Q Median     3Q    Max 
## -4.266 -0.983  0.169  0.771  3.117 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.791085   0.047186   59.15  < 2e-16 ***
## ckjm_dit     0.061084   0.031167    1.96     0.05 .  
## ckjm_wmc     0.004278   0.000519    8.25  < 2e-16 ***
## ckjm_ca     -0.003350   0.002862   -1.17     0.24    
## ckjm_cbo    -0.011555   0.001635   -7.07  1.6e-12 ***
## ckjm_noc     0.035429   0.020205    1.75     0.08 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.0113     0.4247   -2.38   0.0173 * 
## ckjm_dit      0.5876     0.3857    1.52   0.1277   
## ckjm_wmc     -0.0282     0.0106   -2.67   0.0077 **
## ckjm_ca       0.0232     0.0227    1.02   0.3076   
## ckjm_cbo      0.0173     0.0146    1.18   0.2362   
## ckjm_noc      0.1102     0.1046    1.05   0.2917   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 20 
## Log-likelihood: -309 on 12 Df

vuong(znmodel1, zmodel1)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.134 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 4.878e-13
vuong(znmodel2, zmodel2)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.14 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 4.681e-13
vuong(znmodel3, zmodel3)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.136 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 4.821e-13
vuong(znmodel4, zmodel4)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.168 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 3.81e-13
vuong(znmodel5, zmodel5)
## Error: object 'zmodel5' not found
vuong(znmodel6, zmodel6)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.059 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 8.399e-13

We now analyze each of the models.

summary(model1)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, 
##     family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.216  -1.295  -0.906   1.060   4.872  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.77e+00   5.91e-02   29.92  < 2e-16 ***
## ckjm_dit     1.64e-02   4.36e-02    0.38     0.71    
## ckjm_ca      3.60e-03   3.11e-03    1.16     0.25    
## raw_loc      1.67e-04   3.61e-05    4.64  3.6e-06 ***
## ckjm_noc    -3.26e-02   2.15e-02   -1.51     0.13    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 349.63  on 92  degrees of freedom
## AIC: 701.4
## 
## Number of Fisher Scoring iterations: 5

We can see that in model 1 only the intercept and raw_loc are statistically significant correlated to discussion under p < 0.01%.

summary(model2)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, 
##     family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.914  -1.204  -0.894   0.458   4.814  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.75e+00   5.87e-02   29.82  < 2e-16 ***
## ckjm_dit     2.49e-02   4.44e-02    0.56  0.57432    
## ckjm_npm     3.68e-03   1.09e-03    3.38  0.00073 ***
## raw_loc      9.76e-05   4.52e-05    2.16  0.03077 *  
## ckjm_noc    -3.21e-02   2.07e-02   -1.55  0.12056    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 339.85  on 92  degrees of freedom
## AIC: 691.6
## 
## Number of Fisher Scoring iterations: 5

Npm, raw_loc and intercept are statistically significant under p < 0.01%

summary(model3)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, 
##     family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.189  -1.253  -0.826   0.814   4.754  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.695639   0.062733   27.03  < 2e-16 ***
## ckjm_dit    -0.001297   0.043907   -0.03     0.98    
## ckjm_rfc     0.001906   0.000318    5.99  2.1e-09 ***
## ckjm_ca      0.000325   0.003306    0.10     0.92    
## ckjm_noc    -0.025712   0.022172   -1.16     0.25    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 335.77  on 92  degrees of freedom
## AIC: 687.5
## 
## Number of Fisher Scoring iterations: 5

Rfc and intercept under p < 0.01%

summary(model4)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, 
##     family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.076  -1.312  -0.828   0.403   4.652  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.699445   0.062321   27.27  < 2e-16 ***
## ckjm_dit     0.007855   0.044602    0.18  0.86021    
## ckjm_npm     0.001890   0.001285    1.47  0.14128    
## ckjm_rfc     0.001471   0.000434    3.39  0.00069 ***
## ckjm_noc    -0.029658   0.020574   -1.44  0.14942    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 333.61  on 92  degrees of freedom
## AIC: 685.4
## 
## Number of Fisher Scoring iterations: 5

Rfc and intercept under p < 0.01%

summary(model5)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + 
##     ckjm_noc, family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -3.29   -1.38   -1.02    1.14    5.03  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.86e+00   7.34e-02   25.38   <2e-16 ***
## ckjm_dit     1.55e-02   4.63e-02    0.34    0.737    
## ckjm_ca      4.37e-03   3.59e-03    1.22    0.224    
## ckjm_lcom    5.37e-05   2.27e-05    2.36    0.018 *  
## ckjm_cbo    -2.87e-04   3.04e-03   -0.09    0.925    
## ckjm_noc    -4.18e-02   2.21e-02   -1.89    0.058 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 362.75  on 91  degrees of freedom
## AIC: 716.5
## 
## Number of Fisher Scoring iterations: 5

Lcom and intercept under p < 0.01% and noc on 0.058

summary(model6)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + 
##     ckjm_noc, family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.842  -1.310  -0.891   0.488   4.898  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.796151   0.073749   24.35  < 2e-16 ***
## ckjm_dit    -0.014735   0.047375   -0.31    0.756    
## ckjm_wmc     0.003704   0.000612    6.05  1.4e-09 ***
## ckjm_ca      0.002774   0.003688    0.75    0.452    
## ckjm_cbo    -0.001730   0.002880   -0.60    0.548    
## ckjm_noc    -0.042953   0.022904   -1.88    0.061 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 336.88  on 91  degrees of freedom
## AIC: 690.6
## 
## Number of Fisher Scoring iterations: 5

Wmc and intercept under p < 0.01%

summary(model7)
## 
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + 
##     ckjm_cbo + ckjm_noc, family = poisson, data = derby.discussion.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.988  -1.318  -0.908   0.933   4.654  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.76e+00   7.82e-02   22.53   <2e-16 ***
## ckjm_dit     4.87e-02   4.81e-02    1.01    0.311    
## ckjm_npm     5.28e-03   1.11e-03    4.76    2e-06 ***
## ckjm_lcom   -1.09e-05   2.86e-05   -0.38    0.702    
## ckjm_cbo     1.38e-03   2.76e-03    0.50    0.619    
## ckjm_noc    -3.65e-02   2.07e-02   -1.76    0.079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 377.90  on 96  degrees of freedom
## Residual deviance: 343.88  on 91  degrees of freedom
## AIC: 697.6
## 
## Number of Fisher Scoring iterations: 5

Npm and intercept under p < 0.01%

Concretly, the 7 models can be reduced to the following models:

Model 1: [rawloc] Model 2: [npm, rawloc] Model 3: [rfc] Model 4: [lcom] Model 5: [wmc] Model 6: [npm]

fmodel1 <- glm(discussion ~ raw_loc, data = derby.discussion.train, family = poisson)
fmodel2 <- glm(discussion ~ ckjm_npm + raw_loc, data = derby.discussion.train, 
    family = poisson)
fmodel3 <- glm(discussion ~ ckjm_rfc, data = derby.discussion.train, family = poisson)
fmodel4 <- glm(discussion ~ ckjm_lcom, data = derby.discussion.train, family = poisson)
fmodel5 <- glm(discussion ~ ckjm_wmc, data = derby.discussion.train, family = poisson)
fmodel6 <- glm(discussion ~ ckjm_npm, data = derby.discussion.train, family = poisson)

fmodel1
## 
## Call:  glm(formula = discussion ~ raw_loc, family = poisson, data = derby.discussion.train)
## 
## Coefficients:
## (Intercept)      raw_loc  
##    1.769612     0.000182  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  95 Residual
## Null Deviance:       378 
## Residual Deviance: 353   AIC: 698
fmodel2
## 
## Call:  glm(formula = discussion ~ ckjm_npm + raw_loc, family = poisson, 
##     data = derby.discussion.train)
## 
## Coefficients:
## (Intercept)     ckjm_npm      raw_loc  
##    1.732417     0.003338     0.000113  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  94 Residual
## Null Deviance:       378 
## Residual Deviance: 343   AIC: 691
fmodel3
## 
## Call:  glm(formula = discussion ~ ckjm_rfc, family = poisson, data = derby.discussion.train)
## 
## Coefficients:
## (Intercept)     ckjm_rfc  
##     1.67257      0.00193  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  95 Residual
## Null Deviance:       378 
## Residual Deviance: 338   AIC: 683
fmodel4
## 
## Call:  glm(formula = discussion ~ ckjm_lcom, family = poisson, data = derby.discussion.train)
## 
## Coefficients:
## (Intercept)    ckjm_lcom  
##    1.86e+00     6.24e-05  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  95 Residual
## Null Deviance:       378 
## Residual Deviance: 367   AIC: 713
fmodel5
## 
## Call:  glm(formula = discussion ~ ckjm_wmc, family = poisson, data = derby.discussion.train)
## 
## Coefficients:
## (Intercept)     ckjm_wmc  
##     1.74622      0.00363  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  95 Residual
## Null Deviance:       378 
## Residual Deviance: 341   AIC: 687
fmodel6
## 
## Call:  glm(formula = discussion ~ ckjm_npm, family = poisson, data = derby.discussion.train)
## 
## Coefficients:
## (Intercept)     ckjm_npm  
##       1.781        0.005  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  95 Residual
## Null Deviance:       378 
## Residual Deviance: 349   AIC: 695

Furthermore, the AIC of all the 7 models are given as follows:

c(model1$aic, model2$aic, model3$aic, model4$aic, model5$aic, model6$aic, model7$aic)
## [1] 701.4 691.6 687.5 685.4 716.5 690.6 697.6

According to AIC the best model is 4, given by LCOM. We now test the models using a poisson model for each of the 7. We test their generalization using error functions from the Metrics Library on our test dataset in Derby which was randomly selected to be number fifth. See the documentation for details of the function implementation.

suppressWarnings(suppressMessages(library("Metrics")))
derby.discussion.test = derby.discussion.list[[5]]
fmodel1.e = rmse(predict(fmodel1, data.frame(raw_loc = derby.discussion.test$raw_loc)), 
    derby.discussion.test$discussion)
fmodel2.e = rmse(predict(fmodel2, data.frame(ckjm_npm = derby.discussion.test$ckjm_npm, 
    raw_loc = derby.discussion.test$raw_loc)), derby.discussion.test$discussion)
fmodel3.e = rmse(predict(fmodel3, data.frame(ckjm_rfc = derby.discussion.test$ckjm_rfc)), 
    derby.discussion.test$discussion)
fmodel4.e = rmse(predict(fmodel4, data.frame(ckjm_lcom = derby.discussion.test$ckjm_lcom)), 
    derby.discussion.test$discussion)
fmodel5.e = rmse(predict(fmodel5, data.frame(ckjm_wmc = derby.discussion.test$ckjm_wmc)), 
    derby.discussion.test$discussion)
fmodel6.e = rmse(predict(fmodel6, data.frame(ckjm_npm = derby.discussion.test$ckjm_npm)), 
    derby.discussion.test$discussion)

The RMSE of the 6 models is given as follows (see this for a reference of RMSE and other error measures):

c(fmodel1.e, fmodel2.e, fmodel3.e, fmodel4.e, fmodel5.e, fmodel6.e)
## [1] 7.199 7.203 7.186 7.172 7.185 7.210

Plot all scatterplots

plot(derby.discussion.list[[1]])

plot of chunk unnamed-chunk-30

plot(derby.discussion.list[[2]])

plot of chunk unnamed-chunk-30

plot(derby.discussion.list[[3]])

plot of chunk unnamed-chunk-30

plot(derby.discussion.list[[4]])

plot of chunk unnamed-chunk-30

plot(derby.discussion.list[[5]])

plot of chunk unnamed-chunk-30

plot(derby.discussion.list[[6]])

plot of chunk unnamed-chunk-30


plot(lucene.discussion.list[[1]])

plot of chunk unnamed-chunk-30


plot(pdfbox.discussion.list[[1]])

plot of chunk unnamed-chunk-30

plot(pdfbox.discussion.list[[2]])

plot of chunk unnamed-chunk-30

plot(pdfbox.discussion.list[[3]])

plot of chunk unnamed-chunk-30

plot(pdfbox.discussion.list[[4]])

plot of chunk unnamed-chunk-30


plot(ivy.discussion.list[[1]])

plot of chunk unnamed-chunk-30

plot(ivy.discussion.list[[2]])

plot of chunk unnamed-chunk-30

Since most models have only a single variable, lets plot them:

plot(derby.discussion.list[[2]]$raw_loc, derby.discussion.list[[2]]$discussion, 
    pch = 19, col = "darkgrey", xlab = "Raw LOC", ylab = "Discussion")
lines(derby.discussion.list[[2]]$raw_loc, predict(fmodel1, data.frame(raw_loc = derby.discussion.list[[2]]$raw_loc), 
    type = "response"), col = "red", lwd = 3)

plot of chunk unnamed-chunk-31


predict(fmodel1, data.frame(raw_loc = derby.discussion.list[[2]]$raw_loc), type = "response")
##      1      2      3      4      5      6      7      8      9     10 
##  6.081  5.909  7.190  5.910  7.275  7.275  6.827  7.663  6.366  6.186 
##     11     12     13     14     15     16     17     18     19     20 
##  5.905  5.922  7.538  7.538 10.147 10.147 10.147 10.147 10.147 10.147 
##     21     22     23     24     25     26     27     28     29     30 
## 10.147  5.989  5.885  6.068  7.844  7.844  7.844  6.152  5.998  6.557 
##     31     32     33     34     35     36     37     38     39     40 
##  7.217  5.921  5.889 13.639 13.639 13.639 13.639 13.639  6.175  6.952 
##     41     42     43     44     45     46     47     48     49     50 
##  6.751  6.522  6.522  5.900  6.050  5.986  6.266  5.926  6.113  5.905 
##     51     52     53     54     55     56     57     58     59     60 
##  5.905  6.071  6.075  5.924  6.268  6.005  6.503  6.425  7.014  6.597 
##     61     62     63     64     65     66     67     68     69     70 
##  7.664  7.664  7.664  6.559  6.559  6.160  5.983  6.477  6.260  7.466 
##     71     72     73     74     75     76     77     78     79     80 
##  6.051  6.532  6.532  6.285  6.179  6.179  6.179 15.681 15.681  6.285 
##     81     82     83     84     85     86     87     88     89     90 
##  6.084  6.043  6.090  6.386  6.162  6.518  9.241  6.826  6.237  6.359 
##     91     92     93     94     95     96     97     98     99    100 
##  7.146  8.820  6.795  6.795  8.795  7.032  5.994  6.156  6.452  6.452 
##    101    102    103    104    105    106    107    108    109    110 
##  7.825  6.082  6.078  6.006  6.052  5.973  5.924  5.936  6.919  5.989 
##    111    112    113    114    115    116    117    118    119    120 
##  5.989  5.913  6.002  6.469  6.469  6.219  6.089  6.540  6.093  6.905 
##    121    122    123    124    125    126    127    128    129    130 
##  7.629  6.255  7.919  6.602  6.602  6.577  6.448  6.491  6.416  5.917 
##    131    132    133    134    135    136    137    138    139    140 
##  6.246  5.927  5.960  7.159  7.159  7.159  6.379  8.222  8.222  6.251 
##    141    142    143    144    145    146    147    148    149    150 
##  5.903  6.523  6.476  6.299 10.688  7.391  6.095  7.349  6.099  6.099 
##    151    152    153    154    155    156    157    158    159    160 
##  6.585  5.901  6.483  7.256  6.485  6.540  5.894  5.894  6.366  6.366 
##    161    162 
##  6.484  6.554
derby.discussion.list[[2]]$discussion
##   [1]  4 20 11 20  4 15  2  4  4  4  4 15  6 15 14  9  5  6  3 15  4  4  4
##  [24]  4  4  6 15  4  4  6  4  2  2  4  7 11  3 46 46 46 15  4  1  1  3  3
##  [47]  2 13 13  6  5 21 21  1  4  1  8  1  4  3 14  6  5  6  5  3  3  3  4
##  [70] 13 13  6  5  1  4  1  6 21  1 21 21 21  4  1  7  3  4 21 21 21  4 12
##  [93]  4 12 21 12 21 12 21  3 21 21 21 21 21 21 21 21 21  6  5 21 21  4  3
## [116] 21 21  1  1  1  7  2  3  5  7  4  3  1  8 15 15 20 20  6 15 20 20 15
## [139] 20 15 15  5 46 12 12  8 12 15  2  4  2  1 46  4  4  4  3  3 15  3 15
## [162] 37

My conclusions are that we can't establish a relation between the structural complexity metrics and the effort estimator discussion. This follows mostly from the 6 scatterplots matrix in respect to the behavior of the discussion cost estimator against each of the file metrics. Furthermore, the statistical models do not suggest that any composition of the file metrics would help on establishing a relation between structural complexity and file metrics. Lastly, this might be due to the way we distribute issue discussion towards each file metric. Concretly, since we repeat the value for each file that was submitted in a patch, we will see many repeated values of discussion for the same file metric. Any previous relation between the amount of discussion and the associated file may thus be influentiating on the relation.

Actions Effort Estimator Analysis

# Create a list of dataframes (tables) where each dataframe contains
# datapoints of a given release for each project.
derby.actions.list = split(derby.discussion, factor(derby$release))
lucene.actions.list = split(lucene.discussion, factor(lucene$release))
pdfbox.actions.list = split(pdfbox.discussion, factor(pdfbox$release))
ivy.actions.list = split(ivy.discussion, factor(ivy$release))

Lets see all plots for actions

plot(derby.actions.list[[1]])

plot of chunk unnamed-chunk-33